DOCOMBNT RESUME 



ED 172 523 FL 010 176 

ArJ"!iO^> Clirk, Jon- "L. J., iL^i , 

"^'f'-^I'^ Dir-ct rr-.^^tinq of Sp-dki:. j Prof ici ^rcy: Th-^ory and 

A ppiica t icii. 

. .jPOriS A(;:jncY Jftico of rlducitic:: (Dn'-: .^), Wuohinqton, c. 

Proc -^1 iii q;^ of Cjj.f-rinc^ cor.'.luct rd by 
Z(Uic'stio:.r\i i'r j;*-ir.q .S^rvicr; in cooo-- r.-ition with rh^ 

i-i - o r q ■■- tow r: U n i v - r s i y .< o a .] : d b 1^* o n n q a a y ^ r. A 
Lii.qiiio*: ic"; {.^ n^hinq-ori, C, narch 1 97r3) 
WAII.ABLF F^IO'I diic'i^ ion .il F-j^tir.q .^-Tvi':\, r ri.nc-*-.or: , M'rw .J-rS'-y 

0 d tn ( f - - ) 

')P^ jr r PK.'^'J , Achi • v-'. ra - r.-*-. I vsts; *Co luni a i ic z t i Cofii p-- r.c-, 

(L^'f. .Miaqe?) ; Cor. t- r^^n c - .ports; Hiqh-^r .Id uc-.tio:: ; 
L^i.-iquaq - Flu iiicy; Lauqu^j-: i £ t r ucti on ; '^L-ir qu aq^^; 
i'r fici e:icy ; ^L-inqu-^.q-:- 5/Ciilo; '^lar.qu'^q-: T-sr^; 
Li:, q-u 1st COinp'^Zr.riC-: ; 3 - cond ary i:''luca" ion; 5^^-'Ccn J 
L:i:; jii-tq> L-icr:inq; 3p-^--ch Co [fiirnj n: "-i on ; rvsr 
Co:, ^tnuct :or. ; ^r..-snir:q; r^^st Veil id icy 

I DFN]- : Ihw.i *t)r:>i T^^^tiiiq 

ARJThACr 

1 h t t ol L j w i a ] o -l p r --^ -.x r - j r ^>.s fr- n t - ■ i i :\ h o o n f r nc ^. 
rroc^^ ^-d in qo: (1) "0 .iv- iop ni- n t xr.d Curr.-nt Us^ ;of ^ h- PS I Or-il 
Int^-nvi.u C--.^^," ijy M. S ol L -:■ :i r q*-^r ; (2) ' Int ^ Lvi^w "^fstirq in 

• N'o: -hat op^ in L mqu iq --.s, " hy A, Lcv^lac-; (3.) " sul i nq C^cor:] 
'l^-^ul.^'^ ^pvikinq /.nility in n ;w .;3runswi rk • s C^nior Hiqh :^chools,»' hy 
Aihr-:r; (u) "Urinq F;il I:it. vi -"w ; d Diaqno^^.ic revaluation 

In --^•^nuiD-::*:, l y ;;:^dn ifn; "Direct r-?t2.nq ct .Sp-- akinq Fkill.^ in 

. i ri J-.-:; ciop. •■' f-r ^^no■--:i iiod'-,'» hy :^ , tr'-inc::.; (6) "Crrii Proficiency 

r -^.trn^j in N^w J -zr^'^y Bili:iqu=il -ird Enqiir/n \:: a ::'^--'C jnd L'Vn'q u^? q--^- 

r-:u:h--:' C • . t i t io • 1 1 q- , i^y t, nrcwa; (7) *' Ad i n ta rion or r h -.- FS I 

: r v: '.^ :ic I ^ tor ; .:-c^ndiry .^chool-s in i ■:o 1 1 .-• n - o-, ny C. R- schk^; 

('H "l-^*:-rvi.w :^ . cn n i qu md ^corinq Crr*— ria at ^h^ . fiiaii^r 
?r i:;: --^ncy L-^v oy Jon o; (9) »»!.. -ninq Sn^^kinq Pr oi ic:i cy 

rnrou.jh Funo- i'-:n;l i:i .ioqu-- ^ , by I, :.oo W i -jqh ; (10) "Scop^- ,ind 
Limr*- i^icn;7 o t Int rvi w-Diord Lanqu.i4r .'jjrtinq: Ar-: '^^ a-.'ki'^q Too 
^'i::h of. ^h'^ I n - - r v i :> w ? »' hy , n^do; (11) " M^:-isar in-: ^or-^iqn Lanquaqv 
:u> ■- ik i a o r ir-i nc y : A r+'.uV/ of Aq r-: ^fn-^-n. r Antcnq Ivit^is," l^y 
A«5'r-.; (12) " ! n d . p ^ nd n ^ ?."inq m Or^l ?roficirncy T r t- vi ^ Wo , " by 
J* v'P. nonv^; (13) ":nird -.iriri of F': I Int rvi^w-/' by P, Lew-, Jr- ; 

(iqj "q-t ririr:nq ^n: ;-:ft''cr of i; nc on t Loii--d SDurc-:^ of ^'rror in a 
Jin ct -^--.-t or Or^^i Proficiency ^^ni nli^ : i b i li ^ y of ^h-- Procdur 
^ nt.-.-c*- : (11 n ro v-ia •n*. Ful lo*/ in q /J I ^.^sro 3 .a in s r ruction "by K. itill-n; 
(1/"^) • ' i: I'inili-y ,nd Vilidi^y cf L-inaui}-- Af^pc-t? Cc:i ^ r i P u ^ : n a 
I*:. !l t- rof ici- n;y o: Pno.:;:>^ct: iv t I^ach^rs ji ;S r a; '\ n , by P. Cliffcri; 
(l*^) ":n""^vr-:^w r-nmq ^^---irch it ■•: due io n ^3 L 'T^-r-^nq S-rvic"-/' hy 

dirk; (17) 'M'syc:^ ^p{ivr;ic:i 1 iciiinq -ti Lanquiq?^ ^rorici^-ncv 
In^^rvi^w, \ P-^' Li:i)ir ir y a-jrt," hy V...c-n*:; -^nd (1d) --inq 
^^viii-rdr U)-ikinq Pr:;f ic i-n-r-y, by d, Pivinu/^on. (A^'d) 

ERIC 



DIRECT TESTING OF SPEAKING PROFICIENCY: 
THEORY AND APPLICATION 



Proceedings oi* a Two-Day Conference Conducted 
by Educational Testing Service in Cooperation 
with the U.S. Interagency Language Round Table 
and the Georgetown University Round Table on 
Languages and Linguistics 



Educational Testing Service, Princeton, NJ 

1978 



John L. D. Clark, ed. 




This document HAS BEEN PEPPO- 
DUCED EXACTLY AS PECEIVED F POM 
THE PEPSON OP OPGANIZATION OPlGIN- 
ATING IT POINTS OF VIEW OP OPINIONS 
STATED DO NOT NECESSAPlLY PEPPE- 
SENTOFFlClAL NATIONAL INSTITUTEOF 
EDUCATION POSITION OP POLICY 



us department of health, 
education 4 welfare 
national institute of 
educatidn 




The work reported herein was conducted under grant No 
G00-77-0i87I from the U.S. Office of Education,' 
Department of Health, Education and Welfare, under the 
authority of Title VI, Section 602, NDEA. 



CONTENTS 



Pref ace 



Development and Current Use of the 
rSI Oral Interview Test 

Interview Testing in Non-European 
Languages 

Measuring Second Language Speaking 
Ability in New Brunswick's Senior High 
Schools 

Using the FSI Interview aq a Diagnostic 
Evaluation Instrument 

Direct Testing of Speaking Skills in a 
Criterion-Referenced Mode 

Oral Proficiency Testing in New Jersey 
Bilingual and English as a Second Language 
Teacher Certification 

Adaptation of the FSI Interview Scale 
for Secondary Schools and Colleges 

Interview Techniques and Scoring Criteria 
at the Higher Proficiency Levels 

Testing Speaking Proficiency through 
Functional Dialogue's 

Scope and Limitations of Interview-Based 
Language Testing: Are We Asking Too Much 
of the Interview? 

Measuring Foreign Language Speaking 
Proficiency; A Study of •Agreement Among 
Raters 

Independent Rating in Oral Proficiency 
Interviews 

Third Rating of FSI Interviews 

Determining the Effect of Uncontrolled 
Sources of Error in a Direct Test of Oral 
Proficiency and the Capability of the 
Procedure to Detect Improvement Following 
Classroom Instruction 



Howard E , Sollenberqer 
William Lovelace 

Murielle Albert 
Stephen L. Graham 
Robert B ■ F ranco 

Richard W* Brown 
Claus Reschke 
Randall L, Jones 
I, F, Roos-Wijqh 

Robert Lado 

Marianne L. Adams 

John Quinones 
Pardee Lowe, Jr , 



Karen A. Mullen 



Reliability and Validity of Language 
Aspects Contributing to Oral Proficiency 
of Prospective Teachers of German 

Interview Testing Research at Educational 
Testing Service 

Psychophysical Scaling of the Language 
Proficiency Interview 
A Preliminary Report 



Ray T, Clifford 
John L, D, Clark 

Robert J. Vincent 



Setting Standards of Speaking 
Prof iciency 



Samuel A, Livingston 



PREFACE 



The eighteen papers presented in this volume form the written record 
of a conference on "Direct Testing of Speaking Proficiency: Theory and 
Application" held at Georgetown University on March 14-15, 1978. It was 
conducted by Educational Testing Service with the cooperation of the U.S. 
Interagency Language Round Table and the Georgetown University Round Table 
on Languages and Linguistics. Financial assistance for the conference and 
for publication of the proceedings was provided by the U.S. Office of 
Education under the authority of Title VI, Section 602, of the National 
Defense Education Act. 

In the approximately twenty years since the initial development, 
by the Foreign Service Institute (FSI), U,S. Department of State, 
of the face-to-fac6 language proficiency interviewing procedure and 
associated rating scale, use of this or related approaches to speaking 
proficiency measurement has become increasingly widespread, both within 
and outside the federal government. A partial list of current users of 
interview-based testing techniques includes, in addition to the FSI, 
ACT I ON/Peace Corps, Bank of Canada, Center for Applied Linguistics, 
Central Intelligence Agency, Chula Vista (Calif.) School District, Cornell 
University, Defense Language Institute, Educational Testing Service, 
Florida International University, Illinois. Bilingual Service Center, 
Language Training Mission of Brigham Young University, Massachusetts 
Department of Education, National Security Agency, New Brunswick (Canada) 
Education Department, and New Jersey Department of Education. 

In view of the increasing interest in and utilization of language 
testing techniques of the FSI type over the past several years, it was 
considered of possible value to bring together — through the medium of a 
formal conference directed exclusively to interview-based assessment 
techniques or other face-to-face testing procedures — major users of these 
techniques and other interested participants, both to review and discuss 
matters of common interest. in direct speaking proficiency testing and to 
serve as a foru.m for the broader dissemination of information in this 
measurement area. 

The conference presentations, reproduced here in their final printed 
form, deal with one or more of the five major topical areas: (1) prac- 
tical applications of direct proficiency testing; (2) testing procedures, 
including performance rating scales and scoring techniques; (3) training 
and quality control of testers and raters; (4) validity and reliability of 
direct testing techn-iques; and (5) current and proposed research and 
development activities in direct proficiency testing. 

The opening paper, by Howard E. Sollenberge r--f ormer director of 
the Foreign Service Institute, who was, as he puts it, "present at the 
creation" of the FSI interview--details the development of the inter- 
viewing and rating procedure and its past and current use by U.S. 
government agencies and discusses the scope of proper utilization of this 

/ 



EKLC 



I 
I 



vi 



technique. Appendix A of his paper reproduces the Absolute Language 
Proficiency Ratings that constitute the official rating seals for the FSI 
interview and that may be referred to as needed in the reading of other 
conference papers* 

The next five papers provide a rather broad overview of the opera- 
tional use of the FSI .technique or adaptat ions of the technique in a 
wide variety of measurement applications in both government and private 
contexts. William Lovelace describes the use of FSI-type interviews to 
evaluate the host-country language proficiency of Peace Corps volunteers 
and discusses some of the special considerations involved in the use of 
English-medium training procedures to train interviewers and raters for 
testing in non-European languages. 

Murielle Albert describes a province-wide system of interview-based 
language testing at the secondary school leveJ for the New Brunswick 
(Canada) Education Department, and emphasizes both programmatic and 
individual-student benefits of a direct proficiency measurement approach. 

The paper by Steven L. Graham gives an overview of thie large-scale, 
intensive language pro.7ram conducted at the Language Training Mission 
(Provo, Utah) and the procedures used by the LTM to initially train and 
subsequently monitor the performance of interview testers/raters; this is 
followed by a discussion of diagnostic checklists and other procedures 
used to provide feedback to individual examinees. 

Robert B. Franco of the Defense Language Institute, Monterey, 
describes the recent (1976) revision of the DLI language assessment 
system, which emphasizes the use of criterion-referenced interviewing and 
role-playing situations to determine students' functional command of the 
spoken language. 

The papor by Richard W. Brown summarizes the bilingual and 
Enq 1 ish-second-languaye teacher certification requirements recently 
adu,.tei' by the state of New Jersey and describes the interview-based 
testing program through which the speaking proficiency of teachers and 
teacher candidates is measured for certification purposes. 

The next four papers address a number of different aspects of the 
interviewing process and suggest certain changes in testing techniques, 
scoring procedures, or utilization of results, both to guard against 
possibly inappropriate applications of this measurement technique and to 
enhance the measurement value of the interview approach for situations in 
which its use can be recommended. The paper by Claus Reschke proposes 
an expansion of the interview rating scale to provide more detailed 
information on examinee performance, especially for use at the secondary 
school and early college levels, where the total rsnqe of performance is 
typically restricted to the lower (0 - 2+) portion of the total FSI scale. 

At the other end of the proficiency spectrum, Randall L. Jones 
addresses the challenge of testing examinees at the higher (3+ - 5) score 
levels, c^nd describes his experimentation with a variety of supplementary 



Vll 



techniques, including low-frequency vocabulary testing, sentence* repeti- 
tion, and specified situational cues, to measure the sophisticated kinds 
of language behavior at issue in the upper regions of the FSI scale. 

Ingrid F. Roos-Wijgh, of the Dutch National Institute for Educa- 
tional Measurement (CITO), describes a test development project based 
on role-piaying techniques that engage examinees in'realistic dialogue 
situations for specified, communicative purposes. This testing approach — 
although historically and operationally distinct from the FSI interview as 
it has developed in the United States — is of considerable relevance to the 
examiner/examinee "situation" that is orten included as the final step in 
the interview process. 

In his detailed and wide-ranging paper, Robert Lado undertakes an 
analysis of the nature and psychometric characteristics of interview-based 
testing procedures in comparison with alternative or suEjplemen tary 
approaches, including the use of i object ive^ tests to assess/the listening 
comprehension aspects of an examinee's performance and discrete-item tests 
of grammar, vocabulary, and pronil^nciation when diagnostic information on 
these language aspects is desired-rather than or in addition to the more 
global appraisal of proficiency provided by the face-to-face interview. 

The third series of papers, comprising the conference presentations 
of six authors, addresses in some detail the ba.sic psychometric charac- 
teristics of the FSI-type interview (or adapted versions of the interview) 
as they are manifested in operational use on the interview technique in a 
variety of measurement contexts. U ^ ^ 

Marianne L. Adams presents the results of a detailed study of the 
inter rater reliability of the interview process as carried out by French, 
German, and Spanish interviewers/raters at the FSI and cites a very high 
degree of scoring consistency for raters in these three | language groups. 

John Quinones describes an adaptation of the interview scoring 
process th'ac involves use by the raters of a graphic scoring scale that is 
seen to permit more fine-grained discrimination of examinee performance 
than is possible under the regular (categorical) rating system 9nd to 
facilitate the combining and analysis of ratings assigned by two or more 
raters to a single examinee. 

The paper by Pardee Lowe, Jr., summarizes a recent study in which 
"third raters" of proficiency interviews (i.e., any evaluators of a 
given interview other than those present at the original interview) were 
f ound--con t rary to expectations — to be generally no more severe in their 
ratings than the original rating team, supporting the valiuity of "third 
ratings" as conceptually and operationally similar to ^iiose. given during 
initial scoring. 

Karen A. Mullen reports high interrater correlations for an 
FSI-type test using l; modified rating procedure ("poor," "fair," "good," 
"above average," and "excellent" for each of the language aspects of ' 
listening, pronunciation, fluency, and grammar), and compares pre- and 



Vlll 



po s t -i ns t ru c t i on interview scores for a groLip of undergraduate ESL 
students to similar scores on the Test of English as a Foreign Language 
(TOEFL). Results of this comparison are analyzed in terms of the nature 
and measurement purposes of the two types of instruments. 

Ray T. Clifford describes the development of a modified interview 
rating scale synthesizing the FSI verbal descriptions with five other 
rating scalesj, and subsequently used in conjunction with a -'Teacher 
Oral Proficiency" interview that was experimentally compared to a tape 
recording- and booklet-mediated speaking test (the MLA Cooperative Foreign 
Language Proficiency Test) with a group of prospective German teachers at 
the University of Minnesota. ResiilLs of this study provide comparative 
information on the interrater, intrarater, and test-retest reliabilities 
of. the direct testing procedure vis-^-vis the more highly structured MLA 
test, as well as initial data on the convergent and discriminant validity 
of both testing procedures as applied to the diagnostic assessment of 
discrete aspects of language performance (grammatical control, vocabulary, 
pronunciation, and fluency)* 

The editor reports on several interview-based testing studies 
conducted at Educational Testing Service and discusses study results from 
the viewpoints of prediction of rater competence based on performance 
during rating training; scoring reliability of trained interviewers; 
relationship nf interview scores to other measures of language competence; 
and duration of interview as related to the practicality , validity, and 
reliability of the interview process. 

The two final papers address ^he use and interpretation of interview- 
based test results. Robert J. Vincent presents the results of a study in 
which experienced language teachers were asked to estimate the relative 
difficulty of training a beginning language student from "zero" to any 
given level on the FSI scale, or between any two pairs of levels on the 
scale. Perceived difficulty data of the type presented, together with 
empirically derived measures of language learning difficulty (such 
as total contact hours required to reach various FSI levels), is of 
considerable interest from a psycholinguistic standpoint and is also of 
practical value in promoting a more accurate and realistic conception on 
the part of language teachers and administrators regarding the difficulty 
and amount of training required to reach specified levels of language 
competence. 

Samuel A. Livingston describes the operation and results of an 
empirically based study conducted in collaboration with the New Jersey 
Depc3rtment of Education to assist the Department in the setting of 
"passing" standarus for bilingual and ESL teacher candidates on the 
FSI-type interview used as part of the certification process in the 
state. In addition to presenting the results of the New Jersey study, the 
author discusses the standard-setting procedure on a more general basis 
and urges ^.he use of this or a similar technique in any other important 
"decision-making" contexts involving the use of interview test results. 



1 



ix 



Numerous individuals and several different organizations contributed 
in a variety of ways to the initial planning and conduct of the conference 
and to the compilation of the conference proceedings. 

I would first like to thank my friend and colleague of long standing, 
Mr. Protase Woodford — associate director of the International Office at 
Educational Testing Service and project director for the conference — for 
his initial perception of the appropriateness and usefulness of convening 
current users of the FSI interview technique and other Pace-to-face 
speaking proficiency measures to describe their own testing activities and 
to share information, insights, and mutual concerns with others involved 
in or interested in the potential applications of these measurement 
approaches. His continued interest and support at all stages of the 
conference are much appreciated ^and gratefully acknowledged here. 

Both Mr. Woodford and I are in turn indebted to each of the other 
individuals and groups who helped make the conference a reality, most 
notably Mrs. Julia A. Pet ro v-^Chief of the Research Program, Inter- 
national Studies Branch of t^e Division of International Education, 
DHEW/USOE and project officer for the conf e rence--who , from the very 
beginning of discussions with her office and throughout the project 
period, fully supported the underlying rationale and purposes of the 
conference and provided valuablb suggestions on its overall content, 
structure, and implementation. 

The conference also benefite.d greatly in the early planning stages 
from correspondence and discussions with Dr. James R. Frith, dean of the 
School of Language Studies at the Foreign Sf.rvice Institute and chairman 
of the Management Committee of the Interagency Language Round Table, and 
with Or. Dorothy E. Waugh, chairman of the Testing Committee of the 
Interagency Language Round Table, and the other members of the Testing 
Committee. All of these contacts were of substantial value in identifying 
and seeking the representation at the conference of both government and 
nonrjo ve rnment agencies known to be using thef SI interview technique or 
adaptations of it, and in ^ dent if ying specific topics and potential 
presenters for the conference. 

Dr. James E. Alatis, dean of the School of Languages and Linguistics 
at Georgetown University and chairman of the 1978 Georgetown University 
Round Table on Languages and Linguistics, lent his full support to the 
purposes of the conference and graciously arranged for the conference to 
be included as a presession component of the 1978 Georgetown University 
Round Table. He and his associate, Mrs. Carolyn Adger, made available 
highly suitable meeting facilities on the Georgetown campus and extended 
every personal and professional courtesy in the course of the conference 
sessions. 

Valuable assistance in coordinating conference arrangements in the 
Washington area and in providing on-site administrative support during the 
two conference days were provided, respectively, by Dr. Tracy bray and Ms. 
An 1 Convery of the Center for Applied Linguistics. 



X 



staff members at Educational Testing Service who made substantial 
contributions to the work of the conference or the preparation of the 
proceedings include my secretary, Mrs. Dolores Robinson, who was of 
inestimable assistance at all stages of the project; Mrs. Nancy Parr, who 
provided excellent editorial and proofreading support; and Vydec operators 
Mrs. Maryann Cochran and Mrs. Brenda Mahan, whose admirable diligence and 
i nde f at igabili ty provided the camera-ready text- of the proceedings. 

A final acknowledgment and most heartfelt appreciation are expressed 
to all of the conference presenters, whose contributions are reproduced 
in' this volume. If"^ slight semantic liberty can be permitted me, I 
woulr* like to close these introductory paragraphs by stating that these 
individuals were the March 14--15 conference and are the present proceed- 
ings, which it has been my great pleasure and honor to assemble here. 

J.L.D.C. 



11 



DEVELOPMENT AND CURRENT USE OF THE 
FSI ORAL INTERVIEW TEST 



Howard E. SoMenberger 
Director, Foreign Service Institute (retired) 



ERIC 



1 

-4 



DEVELOPMENT AND CURRENT USE OF THE FSI ORAL INTERVIEW TEST 



Howard E. Sollenberger 

I address you today, not as a specialist In foreign language 
testing or as a linguist, but rather as an admin Istre-t Ive philosopher and 
historian. Since I no longer administer, I can perhaps be permitted 
to give you some history of the development of the foreign language 
oral Interview tests of the Foreign Service Institute (FSI) and to 
philosophize on the subject of this conference, "Direct Testing of 
Speaking Proficiency: Theory and Application." 

I hope J am not presumptuous In assuming that a brief historical case 
study of the circumstances under which direct interview testing was first 
attempted on any significant scale, and how It developed into a system 
used throughout the federal government, would be helpful as background for 
our deliberations. Certainly we will want to examine both hhe advantages, 
and the Implications, of putting theory into practice, in institution- 
alizing systems by which we attempt to measure and differentiate human 
performance. 

To paraphrase Dean Acheson, you might say that I was "present at the 
creation" or, perhaps more accurately, at the Incubation of the oral ^ 
Interview testing system developed at the FSI. While It may now be rather 
dim In our memories, we were In a period of "cold war" intensification In 
the early 1950s. It had wide and significant ramifications In our public 
life, and even In education. By the late 1950s it would, among other 
things, generate the Nat lonal Def ense Education Act, which was to support 
the upgrad Ing of sc lence , ma thema t Ics , ..and f ore ign area and language 
studies in Ame r lean' educat Ion. Meanwhile, with the Impetus of the Korean 
War and the experience of having been unprepared for the global war a 
decade earlier, the Civil Service Commission In 1952 was directed, under 
the National Mobilization and Manpower Act, to inventory and develop 
a register of persons In government who had skills, background, and 
experience in various foreign areas and languages. 

Following normal, bur eaucrat Ic procedures, the Civil Service Commis- 
sion created an Interagency committee to -study the problem and recommend 
procedures. At early meetings It became apparent . that , If an inventory 
were to serve any useful purpose, some means of defining and differen- 
tiating levels of foreign language proficiency and area expert Ise would be 
necessary. The old labels of f a Ir/good/f luent/b 11 Ingual wera.xiiiviously 
inadequate. ' ^ 

Dr. Henry Lee Smith (then dean of. the FSI Language School), the State 
Department's representative of the Interagency committee, pressed for a 
system and the development of criteria that would differentiate testable 
levels between "no knowledge" of a given foreign language and "total 
mastery." He was promptly named to head a subcommittee to prepare 
definitions and so-called work Ing papers. As Dr. Smith's alternate on the ~ 
committee, I became Involved as a coconsp I ra tor in trying to get the 
federal government to real 1st Ically face personnel deficiencies In' area 
expertise and foreign language skills. 



r 



-4- 



As it developed, there was not only difference of opinion, but also 
opposition to the concept. There was concern in certain agencies that 
through the proposed survey and the establishment of a national register, 
the Civil Service Commission would further interfere in the personal 
fiefdoms of the various agencies. There was also fear that testing based 
on new absolute standards would. prove embarrassing to many employees who 
had cla irr.ed . "f luency " in a foreign language or their applications for 
employment. To make a long story short, a corr.promise was reached that 
provided for each agency to conduct its own survey using definitions and 
criteria established by the committee. Testing would be optional. 

There were five different factors considered in defining and differ- 
entiacing levels of area expertise: systematic area training (A), basic 
social science training (S), professional experience in an area (PA), 
professional experience related to an area (PE), and residence in an 
area (AR), Three to five differentiated levels were defined under each 
f actor. 

Under the language proficiency section;: symbolized oy, the letter L, 
six differentiated levels were defined. To avoid complicating the task, 
no effort was made to separate the components of language proficiency, 
which were general ly cons ide red to be comprehension of oral production, 
speaking proficiency, reading proficiency and comprehension, and writing. 
At the base of the scale, L-1 was defined as "no proficiency in either 
reading or speaking a' foreign language." 

The upper end'^ of the scale, L-6, was defined as "sufficient pro- 
ficiency in speaking, reading and writing to negotiate oral and written 
agreements and to thoroughly understand the press, popular and classical 
literature "and official documents." It was noted that "this category is 
reserved for bilingual or native speaker ; of the language." 

It was proposed that category l- '4 be considered as the minimum 
proficiency level for inventory purposes This was defined as "sufficient 
proficiency in speaking a language to conduct ordinary routine business 
conversations and to read general non-technical material." It was noted 
that "this level nf proficiency might normally be acquired by 9 to 12 
months of intensive language training or the equivalent in part-time 
study, depending on the difficulty of the language." 

c 

Bureaucratic foot-dragging, a change in the administration, and 
winding down of the Korean War resul te'"d in the whole project .^being 
shelved. 

However, at the rSI» enough interest had been generated in the 
potential usef ulness .of this approach to stimulate further, ref inement of 
the scale and to experiment with structured oral interview testing of 
students. 

The second impetus came in 1953, when Loy Henderson, then Deputy 
Undersecretary of State, decided to conduct a survey of foreign language 



hi 



-5- 



skills in the Foreign Service. Up to that time there had never been an 
inventory of language skills in the Foreign Service. Mr. Henderson was 
motivated by a conviction that post-war diplomacy would increasingly 
require face-to-face communication with people around the world as well 
as between government representatives and diplomats. In spite of some 
opposition within the Foreign Service, Mr. Henderson insisted that the 
survey be followed by testing. He also intended to tie promotions to 
tested foreign language proficiency. This was seriour> business in the 
highly competitive Foreign Service. It was also serious business for the 
FSI and those who would design and conduct the tests. 

Testing of the 1952 definitions^ of L-1 through L-6 on some 200 
officers showed them to be inadequate for the purpose of a self -appraisal 
survey of the Foreign Service. It became apparent that speaking and 
reading proficiencies would have to be separately determined. From this 
emerged the L and R scales, with the speaking (oral production) scale (L) 
differentiated from 1 to 6, and reading facility (R) differentiated from 
1 to 5. 

Witfi this instrument a self -appraisal survey was conducted in the 
Foreign Service. It revealed that less than half of the 4,041 regular, 
reserve, and staff officers surveyed had a "useful to the service'* 
proficiency in French, German, or Spanish. (These three languages, along 
with. English, were considered the "world languages'* of diplomacy.) 
"Useful" was then defined as "sufficien': control of the structure of a 
language, and adequate vocabulary, to handle routine representation 
requirements and professional discussions within one or more special 
fields, and--with the exception of such languages as Chinese, Japanese, 
Arabic, etc. --the ability to read non-technical news or technical writi/ig 
in a special field," This was the L-^, R-3 level as defined' in the 
self -appraisal scales. 

These findings led to a new langua:ge policy, announced by the 
Secretary of State on November 2, 1956. This policy wasbased on the 
premise that foreign language skills are vital in the conduct of foreign 
affairs. Therefore, "each officer (would^ be encouraged to acquire a 
'useful' knowledge of two (2) foreign languages, as well as sufficient 
command of the language of each post of assignment to be able to use 
g r ee t in gs , o r d i na ry social expressions and numbers; to ask simple 
questions and give simple directions; and to recognize proper names, 
street signs and office and shDp designations." It further stated: 
"Evidence of achievement will oe verified by tests administered by the 
Fore ign . Service Institute. " 

Having been committed to testing, FSI was under pressure to develop 
reliable test procedure's. As Claudia P. Wilds pointed out in her paper 
"The Oral Interview Test," published in 1975 by the Center for Applied 
Linguistics in Testing Language Proficiency : "Both the scope and the 
restrictions of the testing situation provided problems and requirements" 
previously unknown in language testing. 



-6- 



In the course of developing and refining oral interview test pro- 
cedures, Professor John B. Carroll, then of Harvard, was consulted. This 
. led to a revision of the differentiated levels of prof iciency and the 
redesignation of the symbols and levels. The symbol L was changed to S 
to identify the scale for speaking proficiency. R remained the symbol 
for the reading scale. Each scale was differentiated into six levels, 
numbered from 0 to 5. 

Since this provided, for the first time, officially approved perform- 
ance and criterion-based definitions that testers, instructors, and 
administrators found usef'ul, the system rapidly became institutionalized 
and the 5 and R symbols became part of the jargon. 

Not surprisingly, problems began to emerge. Officers being tested 
complained that different testing teams applied different standards, 
particularly in testing different languages. For example, it was commonly 
believed--and with some justification — than an 5-3 rating was much 
tougher to get in French than in the so-called hard or esoteric languages. 
It was also rumored that students tested by* their own instructors seemed 
to fare better than those who simply came in for tests. Testers seemed 
to be . more critical in judging the performance of those whom they did not 
know through a teacher-student relationship. In some cases, the rank and 
age of the officers were seen to influence the rating. Informally there 
developed what became known as the "compassionate" 5-3 rating. There was 
also evidence that some testers seemed to be unduly influenced by the 
•'personalities and cooperativeness of persons being tested. 

With mandatory testing of Foreign 5ervice officers announced in 1957, 
and with assignments and promotions to be influenced by the results, these 
problems had to be solved. An independent testing unit was established in 
July 1958, with Frank A. Rice as head of the unit and Claudia Wilds as his 
assistant. It was through the collaboration of these two people that a 
significant breakthrough came in standardizing oral testing procedures. A 
checklist was developed that contained five "factors": accent, grammar, 
vocabulary, fluency, and comprehension. Considerable work went into 
selecting these factors. The criterion was that they should be of a 
sufficiently general nature that they would apply equally well to all 
languages. Each factor was subdivided as a six-point descriptive scale, 
with "polar" terms X (extremely poor or inadequate) and Y (extremely good, 
accurate, or complete). 

As Frank Rice pointed out in an article entitled "The Foreign 
5ervice Institute Tests Language Proficiencies" ( Linguistic Reporter , May 
1959): "The original purpose of hhe Check List was, to help counterbalance 
the inherent subjectivity of the testing procedure by providing agreement 
about what aspects of the performance were to be observed, a* control on 
the attention of the observer's, and a system of notation that would make 
judgments of different observers more nearly comparable. 

^"There is no doubt that the Check List accomplished its original 
purpose. This was expected. What was quite unexpected was what emerged' 



ERIC 



-7- 



from statistical analysis. This provided basic evidence of a high degree 
of consistency in the subjective judgments of the examiners. The instru- 
ment could thus serve hot only as a useful record, but also as a highly 
accurate predictor." 

It also provided a means for training testers. Claudia Wilds, who 
was appointed head of the testing unit in 1963, subsequently developed a 
weighted scoring system for the checklist. Among other things, this 
provided a means for occasional verifications of the checklist profiles 
and seemed to keep examiners in all languages reasonably in line with each 
other. 

Further evidence of the success of this system was the sharp drop-off 
of complaints from persons being tested, and general acceptance of the 
results even for critical personnel decisions. Also, use of the rating 
■ scale and test results began to spread. With some modifications, the CIA 
developed a similar system, and the United States Information Agency and 
the Agency for International Development joined with the Department of 
State in using the FS [-developed standards and testing facilities. 

Even the Congress used them, demanding reports based on FSI standards 
to show progress toward compliance with a legislative mandate that the 
Department of State "designate every Foreign Service officer position in a 
foreign country whose incumbent should have a useful knowledge of a 
language or. dialect common to such, country [and that] each position so 
designated... be filled only by an incumbent having such ..knowledge" (Sec. 
578 Foreign Service Act of 1946). 

^ With the spreading use, in the 1960s, of the proficiency rating 
scale to other agencies, including the Defense Language Institute and the 
Peace Corps, it became apparent that the definitions should be further 
revised and standardized among agencies. Representatives of the fSI, the 
CIA, the Defense Language Institute, and the Civil Service Commission 
met in 1968 and developed a unified version of the definitions. These 
definitions are essentially the ones used today, and are shown as Appendix 
A of this paper . 

Now, twenty-five years after the inception of a criterion-referenced 
rating scale, it has been incorporated into the federal personnel manual 
for use throughout the U.S. government, and it has been adopted by the 
Supreme Headquarters of the Allied Powers in Europe. Educational Testing 
Service has joined the ranks of user?, and increasing interest has 
been shown in academic circles--an interest that promises impact and 
contributions in the future. 

At the beginning of this paper, I stated my hope that we would 
examine the limitations and implications of applying theory to practice in 
the direct testing of speaking proficiency. As I have observed this in 
the. go ve rnmen t , it has become apparent to me that one of the principal 
limitations is the inability of this system to make meaningful judgments 
or to measure the most significant objective of .Kuman speech — effective 



1 

-4. 



-8- 



communication. ' By this I mean the effectiveness or lack thereof of an 
individual in listening to and fully understanding what he hears through 
the static of cultural differences and the peculiarity of personality, and 
the ability to communicate fully with another person of a different 
culture in such a way as to achieve understanding and cooperation. 

I have observed more than a few cases where I cringed at the thought 
that an individual would represent the United States overseas, even 
though he had been given a high S-4, R-4 language proficiency rating 
by our tests. The person's so-called language proficiency, while it 
may havL been quite accurate in terms of technical skill, did not mean 
effectiveness in communication . In some cases, it may have enabled the 
person to misrepresent or foul up more effectivel>. This is to say that 
you can be a fool In any language or that you can -put your foot in your 
mouth in any language. Nor does the fact of technical ability to use 
a foreign language without noticeable accent or grammatical errors mean 
that the person has something worth saying. Tm sure we all know people 
who talk nonsense fluently- 

On the other hand, I know people who butcher the language, whose 
accents are atrocious, and whose vocabularies are limited. For these 
reasons we give them low proficiency ratings. Yet', for some reason, 
some of them are effective communicators. 

You may rightly say that the tests we have deyeloped do not measure 
this dimension of effective communication. Still, I know a number 
of administrators and even some linguists who do not understand the 
imolication of this difference. 

I have also observed, in the application of these testing procedures 
in training situations, a tendency to train- for success on the test score, 
or to the standards of the test, rather than for broad ef f ectivehess in 
communication. It becomes more important to the teacher and the student 
that they achieve the S-3 level, rather- than that they be effective 
communicators. These are not necessarily mutual.^ exclusive objectives, 
but there are times when this is forgotten. 

I am not saying that these, limitations, which deal with the use of 
measurement devices we create, should cause us to abandon our efforts to 
perfect and use such systems. It is, however, my conviction that these 
and other limitat ions "must be recognized and that we have a continuing 
obligation to make these limitations known to end users. In this we 
are no different from the scientist who makes --a discovery that can, if 
properly used, be of benefit to human kind but that can also be misused. 
I hope this conference will not ignore these responsibilities. 



1 >; 



-9- 



Appendix A 
Absolute Language Proficiency Ratingsl 

The rating scales described below have been developed by the Foreign 
Service Institute to provide a meaningful method of characterizing the 
language skills of foreign service personnel of the Department of State 
and of other Government agencies- Unlike academic grades, whicf measure 
achievement in mastering the content of a prescribed course, the S-rating 
for speaking proficiency and the R-rating for reading proficiency are 
based on the absolute criterion of the command of an educated native 
speaker of the language. 

The definition of each proficiency level has been worded so as to be 
applicable to every language; obviously the amount of time and training 
required to reach a certain level will vary widely from language to 
language, as will the specific I'inguistic features. Nevertheless, a 
person with S-3's in both French and Chinese, for example, should have 
approximately equal linguistic competence in the two languages. 

The scales' are intended to apply principally to government personnel 
engaged in international affairsj especially of a diplomatic, political, 
economic and cultural nature. For this reason heavy stress is laid at 
the upper levels on accuracy of structure and precision of vocabulary 
sufficient to be both acceptable and effective in dealing with the 
educated citizen of the foreign country. 

As currently used, all th,e ratings except the S-5 and R-5 may be 
modified by a plus (+), indicating that proficiency substantially exceeds 
the ••minimum requirements for the level involved but falls short of those 
for the next higher level. 



J-FSI Circular, November 1968. 



1/ 



j 

Definitions pf Absolute Ratings 

Elementary Proficiency 

S-1 Able to satisfy ro utine travel needs and minimum courte sy 
requirements. Can ask and answer questions on topics very 
familiar to him; within the scope of his very limited language 
experience can understand simple questions and statements, 
allowing for slowed speech, repetition or paraphrase; Speaking 
vocabulary inadequate to express anything but the most 
elementary needs; errors in pronunciation and grammar are 
frequent, but can be understood by a native speaker used to 
dealing with . foreigners attempting to speak his language; while 
topics which are "very familiar" and elementary needs vary 
considerably from individual to individual, any person at the 
S-1 level should be able to order a simple meal, ask for shelter 
or lodging, ask and give simple directions, make purchases, and 
tell time. . . 

Able to r ead some pef-sonal and place names, street sig ns, office 
and shop- designation s, numbers, and isolated words a nd phrases. 
Can recognize all the letters in the printed version of an 
alphabetic system and high-f reguency elements of a syllabary or 
. a character system. 

Limited Working Proficiency 

S-2 Able_t^satisfy rouU^ne social dema nds and limite d work 
reguirements. Can handle with confidence but not with facility 
most social situations including int roduc.t ior^s and casual 
conversations about current events, as well as work, family 
and autobiuyraphical information; can handle limited work 
reguirements, needing help in handling . any complications or 
diff >cultit.s; can ge^t the gist "of most conversations on 
non-technical subjects (i.e. topics which reguire no specialised 
knowledge) and has a speaking vocabulary sufficient •■o e<press 
himself simply;with some circumlocutions; accent, though cift«n 
• quite faulty, is intelligible; can usually handle elementary 
constructions ;guite accurately but does not have thorough or 
confident control of the grammar. 



R-2 



I 

Able to read simple p rose, in a form eguivalent to typescr ipt or 
printing, on subjects within a familiar context . With extensive 
use of a dictionary can get the general sense of routine 
business let/ters, international news items, or articles in 
technical fields within his competence. 



Minimum Professional Proficiency 



S-3 Able to speak the language with sufficient structural accuracy ' 
and vocabulary to pcirt ic ipa te ' ef f ec t ively in most formal and 
informal conversations on practical, social, and professional 
topics. Can discuss particular interests and special fields of 
competence with reasonable ease; comprehension is guite complete 
for a normal rate of speech; vocabulary is broad enough that he 
rarely has to grope for a word; accent may be obviously foreign; 
control of grammar good; errors never interfere with 
understanding and rarely disturb the native speaker. 

R-3 Able to read standard newspaper items addressed to the general 
reader, routine correspondence, reports and technical material 
in his special field . Can grasp the essentials of articles 
of the above types without using a dictionary; for accurate 
understanding moderately frequent use of a dictionary is 
required. Has occasional dif f iculty with unusually complex 
structures and low-frequency idioms. 

V Full Professional Proficiency 

S-^ Able to use the language fluently and accurately on all levels 
normally pertinent.to professional needs . Can understand 
and participate in any conversation within the range of his 
experience with a high degree of fluency and precision of 
vocabulary; would rarely be taken for a native speaker, but 
can respond appropriately even in unfamiliar situations; errors 
of 'pronunciation and grammar quite rare; can , handle informal 
interpreting from and into the language.. 

R-4 Able to read all styles and forms of the language pertinent to 
professional needs . With occas ional use of a dictionary can 
read moderately difficult prose readily in any area directed to 
the general r6ader, and all material in his special field 
including official and professional documents and 
correspondence; can read reasonably legible handwriting without 
difficulty. 



Native or Bilingual Proficiency 

5-5 Speaking proficiency -equivalent to that of an educated native 
speaker . Has complete fluency in the language such that his 
speech on all levels is fully accepted by educated native 
speakers^ in all of its features, including breadth of vocabulary 
and idiom, colloquialisms, and pertinent cultural, references. 



O T 



Reading proficien cy equivalent to that of an educated nativ e.- 
Can read extremely difficult and abstract prose, as well as 
highly colloquial writings and the classic literarv forms of the 
language. With varying degrees of d i f f icul ty ' can read all 
normal kinds of handwritten documents. 



\ 



\ 

INTERVIEW TESTik IN 

\ 

NON-EUROPEAN LANGlllAGES 



W.illiam Lovelace 
ACTION/Peace Corps 



INTERVIEW TESTING IN NON-EUROPEAN LANGUAGES 



William Lovelace 

One of the most important aspects of overseas service as a Peace 
Corps volunteer is the ability to speak a foreign language or languages. 
Indeed, two of the three Peace Corps goals relate to an improved under- 
standing between Americans and peoples of the world. Training and 
evaluating our volunteers in these languages has been a unique challenge 
to the agency, given the large number of languages volunteers are asked to 
learn (at least twenty in Africa alone) and the fact that these languages 
are often little-known and rarely studied. 

A further complication, particularly in Africa, is that the volun- 
teers must be trained and tested in both the official (European) language 
and the local language. The E.uropean language is almost always a Romance 
language. I say almost always since English is the official language 
of nine African countries as well as Belize, the Eastern Caribbean,. 
Jamaica, and several areas of the Pacific. Even in these Anglophone 
countries, however, English is not always the language most appropriate 
for village-level communicat ion , and proficiency in a local language 
becomes necesssary if the volunteer is to be effective. 

Evaluating the proficiency of our volunteers in the various languages 
of, the world is a challenging assignment, and analyzing the language 
levels in Angloph.o^ne countries has proven to he particularly difficult. 
The history of our language evaluations has, to some'degree, resulted from 
our trriining formats. 

During the early years of the Peace Corps, the majority of the 
training programs took place at university campuses. This classroom 
instruction was compatible with the FSI interview format, and we used FSI 
testers lo interview volunteers in French, Spanish, and Portuguese. As we 
shifted the t raining emphasis to in-country, we had an increased need to 
test in the many national languages our volunteers learn. This meant 
that we could no longer use imported FSI testers; we needed to rely on 
host-country testers to interview volunteers in these languages. Our 
initial contract with Educational Testing Service (ETS), therefore, called 
for not only interviews of language students but also certification of 
testers. However, the certification of testers in "exotic*' languages 
that the certifiers did not speak became a definite complication. .For 
those countries in Lat in America and Africa where Romance languages are 
spoken, these languages were used as certification vehicles for the local 
languages. These Romance languages, however, are not appropriate to Asia 
and to Anglophone countries in the rest of the world. We therefore had 
recourse to certification through English for these situations. 

.The use of a European language as a test medium raises questions, 
some of which I will, discuss. No. matter what theoretical or philosophical 
constraints we may fj^ce'in this testing p^rocedure, we feel we must eval- 
uate all our volunteers. This is in part due to fiscal responsibility. 
V/e spend y large part of our training buduet on language, and we are held 
responsible for tracking the results of theb.e expenditures. This money is 



-16- 



spent to train in many world languages. • At one time we were not equipped 
to train and test in the non-Romance languages. We realized, however, 
that testing and training are absolutely essential for all volunteers if 
we are to honor the commitment contained in the Peace Corps goals I 
rentioned earlier. In surveys taken of the volunteers, we are reminded of 
this neecj). 

The annual survey of volunteers has recently been published, and it 
contains data that are specifically relevant to our language training. 
The study shows a strong correlation between job satisfaction/ 
psychological well-being and an ability to speak the local language. 
The survey also shows a direct connection between satisfied volunteers 
and training programs incorporating home stays with host-country families 
(with a high priority on local language). Further, the survey shows 
that 55 percent of the respondents throughout the Peace Corps use a 
non-English host-country language at least half the time in their work. 
Also, as a group, the volunteers who are least satisified with their 
language training serve in Anglophone countries. It is therefore in these 
countries that we perhaps have the most to accomplish in training and 
evaluation. 

But, it is also in' these countries that we face the challenge of 
certification of testers in English. In our original agreement with ETS, 
if someone were certified in French, Spanish, or Portuguese, that person 
was also certified to test in one or more local languages. We decided to 
maintain this practice and to use a certification kit of listening tapes 
and ETS visits to certify testers in the Anglophone countries. Some of 
the following points of discussion relate to our certifying testers in 
European languages, but there are some ideas specific to certification in 
English that I wish to stress. 

There, is no doubt that the certification by' ETS of host-country 
testers adds an element of "status" and a sort of professional recognition 
to those people working for the Peace .Corps overseas. It must be admitted 
that Peace Corps employment is not always seen as representing any sort of 
professional standing, and our working relationship with an institution 
such as ETS lends credibility to our language program. In Africa, without 
certification through English, we would be unable to have this recognition 
in non-Francophone countries. The use of ETS certification helps assure 
that we have a standardized and Widely recognized "shorthand" for language 
testing throughout Africa and across linguistic lines. This in turn 
enables the volunteers in Anglophone countries to enjoy the same advantage . 
of Francophone volunteers: a record of their language proficiency can be 
kept on file at ETS in Princeton. Admittedly, an official 2+ in Krio or 
siSwati may have -less "clout" and be less valuable for graduate credit 
than a similar score in French or Spanish, but this record can represent a 
tangible acquisition after two years of volunteer service. 

Testi>g volunteers in the host "language also adds a professional note 
to our in-.^jountry language programs. The volunteers are more likely to 
apply thi'i.s^lves in their language studies if they know they are being 
"rated." There is of ten "a spirit of competition and pride iq. the language 



-17- 



programs that would not exist without a record of progress. This 
situation is true for all volunteers, who must be able to deal with local 
and village situations, b^jt it is particularly helpful to volunteers 
serving in countries where one can coast or "get by" in English. 

The use of English for certification of testers has caused some 
concern among those of us ip the Peace Corps working in language programs. 
Perhaps the most obvious issue is that this process requires that the 
candidate have a rather sophisticated' level of English; he or she must be 
able to successfully rate the test tapes. English is widely spoken in 
many overseas countries and, as I mentioned, is an official language in 
large parts of the Peace Corps. However, limiting the group of possible 
testers to those demonstrating an ability to analyze English does place 
a severe constraint on the pool of applicants. There is also the fact 
that in certifying someone to test in English (or a Romance language) 
we have no guarantee that this demonstrated ability to analyze French 
or English can be t ransf erred t o the candidate's non-European native 
language. We must use this inferred ability to shift analytical skills as 
a base in our use of ET5 certification since, with the exception of Latin 
America and parts of the Pacific, our tester-candidates are not native 
speakers of a Eur opean language . It is unrealistic to develop tester 
certification in the many languages volunteers work in, including such 
national languages as Thai, Farsi, and kiSwahili. 

In Africa, the use of English for certif ication also spotlights 
the fact' t h at . Americans are certifying Africans in a language that 
differs somewhat in the various parts of the world. The English spoken 
throughout Africa can certainJy be evaluated against standard norms of 
"correct" English, but there is a, wide range of accents and vocabulary 
among Africans who live thousands of miles apart. The use of English also 
brings out the issue that we are certifying testers in a language they 
will never be asked to test in. We wiH probably never request a host- 
country tester to evaluate a volunteer's English level. 

A further assumption we have made in certifying in English is that 
the person who "passes" the English certification is able to go through 
the same ^hought process in his or her native language. In Anglophone 
Africa' we are often dealing with an indigenous language that the native 
speaker has not studiea as an academic subject,- ? language that may be 
neither written nor read. 

.The nature of many of the African languages has raised the concern 
that these somewhat exotic languages do not necessarily lend themselves 
to an FSI-type interview analysis. Some'of these languages are little 
known or studied other than in linguistic or perhaps missionary circles, 
and there is probably little information available as to the structure, 
patterns, and, elements that would constitute a 2+ in Mende. Our exper- 
ience shows that the tester is usually so taken by the volunteer's ability 
(and desire) ';o speak a language not often studiea by outsiders that the 
ratings depenj almost entirely on fluency, nonverbal social cues inherent 
in the language, and the use of proverbs or vignettes that reflect the 
history or philosophy of the. culture. - 



-18- 



There is, finally, the concern that an FSI interview in a local 
language is not related to the everyday use of the language by the volun- 
teer. These languages are normally used in job-specific settings. 
They would not be used in high-level or official contacts and would 
rarely havt the kind of direct question/answer format of the traditional 
interview. 

Having outlined reasons why we feel we must evaluate our volunteers' 
proficiency m foreign and sometimes exotic languages, and having 
discussed some of the questions raised by certifying host-country testers 
in European languages (and especially in English), I unfortunately have 
little to say about what we are doing to change things. I believe we 
should give more thought to situational testing, which would be more 
closely related to a volunteer's use of the language. To do this, we 
would have to change our test format. We would also have to develop 
criteria for rating someone's ability to perform a set exercise in the 
foreign 'language and, if possible, equate that performance to a scale that 
would have outside recognition, such as an FSI level'. This is a challenge 
facing the Peace Corps, one which the new adrjiinistration of the agency may 
choose to face in the near future. ^ • . . 



MEASURING SECOND LANGUAGE 
, SPEAKING ABILITY IN 
NEW BRUNSWICK'S SENIOR HIGH SCHOOLS 



Murielle Albert 
Education Department 
(New Brunswick, Canada) 



MEASURING SECOND LANGUAGE SPEAKING ABILITY IN 
NEW BRUNSWICK'S SENIOR HIGH SCHOOLS 

' . Murielle Albert 

Introduction 

English is hot the sole. language spokeh in the province of New 
Brunswick, Canada. About 34 percent of New Brunswick's population, which 
. is now close to 700,000, are French-speaking. So, for many of these New 
Brunswickers, Enjlish is a second language — that is, a language necessary 
for certain official, social , commercial , or educational activities within 
their own province and country. 

On July 1, 1977, New Brunswick officially became a bilingual pro- 
vince. Therefore, English is a top requirement of those seeking good jobs 
within the province and is the language in which most of the business 
affairs of the more prestigious and more highly paid jobs are conducted in 
other provinces of Canada. 

Background 

English as a second language has always been taught in New Brunswick's 
schools. Students generally have the opportunity of learning English for 
a minimum of six years tr a maximum of nine years before they leave high 
school. 

Unfortunately, until six or' seven years ago, students leaving high 
school with six to nine years of English could hardly communicate in the 
target language among themselves and even less with English-speaking 
people. Too much stress had been placed on the reading and writing skills 
and not enough on the listening and speaking skills. As a result, the 
Department of Education decided to introduce new programs in New Brunswick 
schools stressing oral proficiency, as summarized in Appendix A. (New 
programs were also introduced for French as a second language.) 

I was teaching at the high school level at the time and was asked by 
my superintendent to pilot one of these new courses taught by the aural- 
oral approach. Having accepted, I spent a few summers studying this new 
approach and became what we call a language model. 

The audiolingual objectives were to teach the student to comprehend 
the language when spoken at nprmal speed; to speak with "neax-native 
pronunciation and intonation"; to read and write "with minimal recourse to 
bilingual dictionaries"; and to "understand" the people, their culture, 
and their heritage. 

Truly, we, the foreign langupge teachers, had come a long way. Once 
absorbed in what we were going to do in the classroom, we were now more 
interested in what we could make possible for students to do there to 
develop their communicative competence as well as an awareness of cultural 
and ,ethnic differences. We were therefore charged to provide learning 

Of- 

ERIC \ 



-22- 



' activities in which students used the language and to assure that students 
were having the best language experience possible, commensurate with their 
abilities, • interests, and age levels. 

The Department and the teachers were very excited about the new 
program, which proved to be the answer to their idea of learning a second 
language. To stress the importance of oral competency in the minds of the 
students and teachers alike, the Department decided to evaluate the spoken 
English as a second language (EASL) or French as a second language (FASL) 
of New Brunswick's high school population. Previously, the evaluation of 
English as a second language had been a written evaluation that basically 
tested the reading and writing" aspects of the language. The listening and 
speaking skills had never been evaluated as such. 

How was this to be done? No oral testing program existed in any 
of the other Canadian provinces. So, the only program to be tried was 
the interview procedure developed by the Foreign Service Institute and 
administered by Educational Testing Service for the Peace Corps and other 
programs. 

Training . 

The purpose of training New Brunswick second language teachers to do 
the interviewing was to ensure that New Brunswick teachers would have as 
much involvement with the program as possible and,. perhaps most important , 
as a result of the training and practice to contribute to their profes- 
sional development as second language teachers. It was assumed that 
teachers who had such a close involvement with the program would be 
supportive of the program and that- maximum cooperation would result. 

To train teachers as classified interviewers for the province, 
practice tapes as well as hftsting tapes had to be made available. The 
voices on the tapes had to be those of our students, interviewed by 
classified interviewers from ETS. And that is how I came to have the 
pleasure of meeting and wo;:king with Russ Webster and Woody Woodford. 

Russ and Woody came to my school, in Caraquet, in the spring of 1974 
to interview and tape sixty students. I don't know who enjoyed those 
sessions more, the interviewees or the interviewers. The students would 
come out of the interviews beaming with excitement. Most of them would 
rush to me and tell me how friendly the two interviewers were--how they 
had made them laugh and actually forget they were speaking English. Even 
the 0 level student felt very much at ease and thought he had performed 
well. The experience proved to be very successful and I, personally, 
was excited about the whole program. 

To date, there have been four training sessions. The first two 
were part of the initial contract with Educational Testing Service; the 
others were added in 1977 and 1978 due to the increased demand for the 
interviews. 



-23- 



A summary of the results of these training sessions follows. 



Session No. 1 

(According to the contract with ETS, this session would train twenty 
New Brun.swick teachers to administer intei^views and in turn to train 
other second language t^iachers in the province.) 



No. Enrolled 



No. Qualified 



No. of Trainers 



FASL 



EASL 



FASL 



EASL 



FASL 



EASL 



10 



10 



2 + 4 (6) 3+7 (10) 



7 



Session No. 2 



No. Enrolled 



No. Qualified 



No. of Trainers 



FASL 



19 



EASL 



27 



FASL 



EASL 



3+7 (iq) 3+11 (14) 



FASL 
7 



EASL 



11 



Session No. 3 



No.. Enrolled 



FASl. EASL 



No. Qualified 
FASL EASL 



28 17 
Session No. 4 



9 



11 



No- Enrolled 



No. Qualified 



FASL EASL 



FASL 



EASL 



10 



16 



10 



15 



In addition, ten individuals who did not qualify at the time of the 
training resubmitted the test tapes to ET?. 



No. Resubmitting Tapes 
FASL ■ EASL 



No. Qualifiea 



FASL 



EASL 



8 



ERIC 



1 



-24- 



We now have forty qualified interviewers for French as a second 
language and fifty qualified interviewers for English as a second lan- 
guage. Included in thei:;e totals are the eleven .French-as-a-second- 
language trainers and eighteen English-as-a-second-language trainers. I 
must add here that .^hese teachers were all invited to participate. It 
wasn't thrown open to all second language teachers. 

The training sessions lasted two to three days. During that period 
of time the teachers, guided by resource people from ETS, discussed the 
technical and linguistic aspects of the language proficiency interview, 
the assignment of interview ratings/discussion of student performance, and 
the numerical rating procedure. Then the recordings of the practice 
interviews were played and the teachers scored them to the best of their 
ability. This was followed by a discussion of the scoring of the above- 
meritioned interviews as they are described in the manual. 

The next step was the formation of groups of. abou t* six to eight 
teachers for, the live interviews. Enough pupils were brought in so every 
ceacher had the opportunity to interview one pupil. While the interviews 
were being recorded the observing teachers and the trainers scored the 
performance. After an interview, the raters discussed both the interview 
techniques and the scoring. By the time the live interviews were over, 
most groups were able to reach basic agreement on methods and standards, 
thereby ensuring a reasonable degree of uniformity. 

The final. step in the training session was scoring the test tapes. 
Each teacher was given ten tapes to score independently, with the help of 
the manual. These test tapes were sent to ETS to be evaluated. Whether a 
person qualified depended primarily on one's success with the test tapes 
and one's ability to interview effectively during the live performance. 

As the teachers weren't too sure what the workshops were all about, 
many were apprehensive and didn't perform to the best of their abilities. 
(To be frank, it isn't a normal situation.) Moreover, they had to perform' 
in the target language, and many teachers felt that their spoken English 
was a bit rusty. As one teacher remarked, "If I had been tested before- 
hand and knew my level' of proficiency, I'd have more confidence." Many 
told me afterwards that the only English they spoke was in t'le classroom 
so they lacked vocabulary when it came to testing higher-level students. 

I had. the opportunity to meet those teachers within the next year, 
and most of them told me how valuable the experience had been for them. 
As- they were classified interviewers, they had a good idea what the spoken 
proficiency of their students was, and their goal was to raise the level 
of proficiency. Many of these teachers succeeded in organizing some sort 
of oral testing program ^^n their schools. Others couldn't organize any 
because they felt it was tob^ time-consuming , especially in the larger high 
schools. But as the interviewers discussed the program with the other 
teachers, an awareness was born and oral production became the primary 
skill to be stressed in our schools. 



-25- 



/ 

/ 



After the first oral interview evaluation, in the spring of 1976, 
teachers realized that in the time that 'had elapsed between training and 
the first day of interviewing, many of the skills had become nebulous. 
For instance, many teachers would spend the entire day interviewing and 
then nost of the night relistening to the tapes to be sure of their 
evaluation. It was suggested that time be allotted to recalibrate the 
interviewers between the training sessions and the actual interviews. 

One such recalibration session was held in January of this year. 
Oiice again the teachers were invited to participate, and of the fifty 
qualified interviewers twenty-nine participated. (Some were ill; others 
were snowbound.) So many teachers responding so well to the call could 
only mean that they were really concerned and felt the need to be recal- 
ibrated. (I think I should mention here that some teachers had to drive 
close to 400 miles round trip.) 

The recalibration session was similar to the training session. It 
was a two-day period designed to permit the interviewers to fully review 
the interview techniques. Once again the teachers, guided by resource 
people from ETS and local trainers, reviewed the technical and linguistic 
aspects of the language proficiency interview, did live interviews, and 
scored new test tapes to be evaluated by ETS. The session was also 
profitable as it was the first time all the qualified teachers were 
working together and the exchange of ideas was invaluable. At the end of 
the two days, the teachers felt better prepared to begin the spring 
testing program* 

Scheduling 

The oral interview evaluation is scheduled for the spring of each 
year, from March to May, inclusive. The high schools are invited 
to participate; it is not compulsory. So far, we've had two testing 
sessions. In the spring of 1976, out of the 68 high schools in the 
province, 7 did not request service and interviews were completed in 
approximately 51. A total of 2,466 students were tested: .1,386 EASL 
and 1,080 FASL. In the spring of 1977, of the 68 high schools, 7 did not 
request service and interviews were completed in approximately 50; 3,417 
students were tested (1 ,927 EASL and 1,490 FASL). We foresee a few more 
schools for this spring. 

The schools taking cidvantage of the service are contacted by the 
interviewers assigned to them; arrangements are made regarding, for 
exnmple, exact dates of the testing, available space needed, and materials 
to be used stapes and tape recorders). 

Jeachers also have to be given sufficient lead time to reacquaint 
themspIve^J with the interview technique. For this purpose, each teacher 
receivers a box of tfie practice tapes plus the ma nua 1 . cont a i n ing the 
fj 1 5*.cur;n inn of the practice interviews and the description of the langu?-]ge 
i rif e r v I ew pr ocjr am , Some teachers meet together to play the previous 



-26- 



interviews, to score them, and to discuss and agree upon the general 
method of conducting the interview. The others " review by themselves, 
inis IS done to ensure a reasonable degree of uniformity. 

In order to release qualified teachers to do the interviews, the 
Bill nfJli.H ^^.^ cost of substitute teachers. Sufficient time is 

also needed for the teachers to prepare work for their students and 
substitute teachers. Most school districts do not want their teachers 

Jhfi^h k"""'^ ^^^"^ f'^r "''^y^- to each interviewer to decide 

whether he or she will. evaluate for five consecutive days or intersperse 
the days. y r 

Instr uctions to Principals and Students 

In schools where there are classified interviewers, students meet 
collectively before the interviews ^egin and the interviewer answers 
any questions and tries to put the pupils at ease. Principals are also 
made aware of the needs of th^ interviewers prior to their ar^ivalln 
the schools. Requirements include rooms away from traffic noise inter- 
viewing cards filled out beforehand, and good sound equipment. 

Taping Oral Interviews 

Each interview is taped individually with tapes provided by the 
department. (Tape recorders are also provided if the schools and teachers 
do not have any.) After the interviews., the tapes are kept at the 
Department like any other departmental grades. They, are used for contin- 

unive'rs?t7p'.''%P^h^'TH '^"^ '"^'^ ^^^^ prospective employers or 
universities if the students concerned are willing to release them. 

f-Ko ^^F^^t^?""^ that the length of arr interview depends on 

the student s ability to communicate. A poor student might be interviewed 
for twelve or fifteen minutes, while a good student might be interviewed 

Ift'pr^.nh ^"""^ teachers take five or ten minutes 

after each interview to finalize their ratings; others give a tentative 
score and take the tapes home to listen to again before givinq their final 
ratings. ^ ^ 

The candidates tested were the students in grades 11 and 12 in all 
the courses: academic, industrial, home economics, and commercial. Each 
teacher interviewed about fifteen, students a day. The smaller high 
schools completed the interviewing in a week or less; it took somewhat 
longer in the larger high schools. The interviewers made sure that 
everyone was tested as they did not want any students to feel left out o*^ 
the activity. 

Student and Teacher Reaction 

The results of the first two evaluation sessions were very positive. 
As with all exam results, some students were pleased and others werp not 

/ 



31 / 



Many who were not pleased with their performance and who were returning to 
school had a goril to work for. If they scored a 1 their aim was to reach 
1+ or 2; if they scored a 2+ their aim was to reach 3 or 3+. 

The results were also a revelation for some teachers. They realized 
that the oral production of their classes was either good or bad and 
decided to do something about the mediocre performance of some of their 
students. ^■ 

Jn many of the large high schools, the department heads and the 
teachers concerned made detailed studies of the results. If, for example, 
35 percent of the students had scored 1 + , the objective of the English 
department for the following year was to try to raise the level to 2 or 
2+. 

As these tests are province-wide, each school knows where it stands 
on a provincial basis. So another incentive for schools is to raise 
their percentile ranks. 

C onclusion 

As an interviewer and a trainer, I can state that both the training 
and the implementat ion of the oral interview process has had a very 
positive effect on second language teachers. Though we conduct interviews 
not directly related to. our local curriculum with students other than our 
own, we are afforded an experience that is not available within our 
own classrooms. 

These tests are competency oriented and the vast majority of the 
students enrolled, limited or. not in their speaking ability, realize that 
in order to be evaluated they need to talk. So talk they do. 

Personally, I f ind ' inter viewing the highest-level student the most 
difficult, as one has to extensively draw out vocabulary, structure, 
grammar, and other aspects in order to accurately judge the level. But at 
the same time, these students are the most interesting to* talk to as they 
are usually most well-read on a variety of topics and no re ready to 
communicate. 

As far as nervousnessis concerned, very few students have that 
problem. The students who are nervous are usually the ^ery, very slow 
students and they will generally tell you when they enter the interview 
room, "I can't speak English buL i understand everything." I do not think 
these student s wou Id do any better with a teacher they knew. 

I certainly think it is a great opportunity for our students to be 
able to find out how competent they are in a second language. For some 
students, these interviews might be their answer to a career they are 
dreaming about. For others, just to know they can communicate in a second 
language will make them more emotionally secure in a new job or in a new 
English community. Therefore, wider horizons are opened to our students. 



NEW BRUNSWICK 
SECOND LANGUAGE 
TESTING PROGRAM 

I 

/ ■ i)iS(RiPii()\ 

" ■■ !i;;...i[Mn luriis: 

'f'*' • Spun:' ' 1' f;'p|ja' Ifli- 

VM'iiil i .iiiL'd.ii^r I'u^pwiw \\\\ Ih' jlIihiii- 
i'l^T^'il M\"::(!. LifiLiiiji:;' sinilnih in^pjriici' 

I ' iii-i^*." !i p'hMMi- !<ir \,iri(nis iislriiLtionjI 

flu- V,Nik! I .itii!iiji:c K'stinj; Proiiniiii 
li'^i* ^^il! ii'i! !u' \iH'>!"nii .iriy |\ir(iail;ir iiKiniial 
!^r:^K[v^': ■if KulriJJiiin ..llii; UMs will focus on 
' ;n'''livr;ik\ ,:ifh,'scj-iulLin>:iiJi:: 

lilt' '''^J-' -Alii ...u'.' !tK' [iii:r kisi. ^iLill ;iaMs ol 

■■^:i;m;":^J ^'^:lhlL'•:!sll]i:KKll:s|(), 
' ' ' ' .III.., ^[iLMKiiii: hi* ex.in;. 

"'^ ■' ijjmii.iL';' i^rjii.iL'n.A iniiTVicws, 

r 

I' ' :h.t: itilcrvicws he 

'-■•-I'^'^i !liv Mpcs n-'jinal ji Dcpardiinil 
I ■! pt'iMil :>[ ilira' yc.ifN. Seniors 
'A'.tiili|!H":'.i\;ii:!t-,pp,,ri'in,i\ |o rcqiii's! |[ki1 Ilk" ' 
^■'■■"'■iii"' "I 'iiO'i 'iu;i ii!i;'r\it'>A he u'nl lo ,i 
pi''S'-H\'.' r'!p|r.w:r .i; : • p"si-M\'ii[ular\ 

■''I''' '' ■^ll' '' ' "ii^lih.'npL' .iJ.'lllsslol) 

I 'i; jnil pro- 

' ■ ■ .l''V'!:);i^'i! vlnriiis: :he in[ervK'Ws 
• V^'^ '-i Jiirini: (lie siiiiiiiicr ot 

Q' y.. .!r ':■!>■ 'i^:!!t v...ic„.,in (lie Senior llieh 
ERJC ^-l>ii'H"::e:il lesfine PIo■l^r.lfI!. 



PAPER-AND-PHNCIL TESTS 

Multiple-choice tests will be used to test rejiiing 
and wniing. Filly minutes will be allowed to 
conipleie each test. . 



THt READING TEST 




The readme test will containMwo [ypes of ques* 
tions (alvocahulary in contex: and Iblreadini: 
comprehension, based on a variety pf passages 
selected by New Brunswick teachers. Passages were 
selected and questions devised to cover a wide 
array of diffiiuliies and content areas. The reading 
test will contain hO questions. 



THE WRITING TEST 




The writing test will be an indirect' measure of the 
writing skill It 'will test the abilityjo distinguish 
among structures usually considered important in 
writing the second language and to select those 
appropriate for a given context. There will be 
three types of questions: la) usage, (b) sentence 
^ L'orre:iion. and (c) sentence completion. These 
i|iiest!onsw-ill 'Cover a variety of grammatical and 
s:ylisik. problems and vary in difficulty. The 
w:iiingles[ will contain HO questions. 



LANGUAGE PROFICIENCY 
INTERVIEWS 

l.anpujge proficiency interviews will be conducted 
linder standardised conditions by New Brunswick 
second language teachers who have been trained as 
second language interviewers through an in-service 
educational program implemented by the New 
Brunswick Departineit of Education and Educa- 
tional Testing Service. Only those students in 
second language courses:'jt grades II and 1] will 
he interviewed, Each language proficiency inler- 
wev\ will give ihc student an opporlunity t.' 
dcinonMrate. in a reaiistu conve/s.inon;: siiiiih 
Hon. 'lie extern o! iiis spoken mastery if ||k 
Ne;ont; languaj:':. as well js ms Mi^, p. 
Man. !li >poKe;] la'ngna.kii "'"ne spe.':;;: mlenl 
ilic HKCMew^ wii; no' :v rcucierniin:.: /: \s' 
• ''I'^'i' '^ -N. ;i; uu ,u. !'i> \:iua}i: ? ii^pat. ' 
lor Ik' Mrvim, beyonu engaging in siniiia; 
conversalional types of (Experiences. The lolhwiiu: 
areas of proficiency will Iv evaluated: pronuncia- 
tion, grammatical accuracy, vocabulary, fluency, 
and listening comprehension. A scale coniprised ol 
competency levels within each .area of language 
proficiency will he employed. These scores will he 
tabulated for each student and summed according 
to a predetermined weighting. The sum will then 
be converted to a five-level overall language 
proficiency score. 

OVERALL LEVELS 
OF LANGUAGE PROFKIENCY 

Level I : Able to satisfy travel needs and minimum 

courtesy requirements. 
Level y. Able to meet basic social demanos and t(^ 

satisfy simple needs related to school and 
work. 

Level 3: Able to speak the language with sufficient 
structural accuracy and vocabulary lu 
participate effectively in most formal and 
informal conversations on practical and 
social topics, 

Level 4: Able to u.se the language fliienlly and 
accurately on all levels normally pertinent 
to the needs of all formal and informal 
conversations op practical, social, and 
work-relaled topics. 

Level 5; Speaking proficiency equivalent Id thai 
ofan educated native speaker, ^ 



NOUVfAU-BRUNSWICK 
PROGRAMME DES TESTS 
DELANCUESSECONDES 

DISl kll'llON (;1\1RA1,1- 

■n'i . •• il'i \|ii:L.I;!i' I. : I iiih J I.I 

■!( J:'.t\ 'u-, I'.v.iiJ.i:'-, iL' f,i I'inUIM' II 
xl I i:-t. |t;ii|| il.ifis ■ -/^ ills jTli. 

m:'' I'M- '•■•I'Mr.iiri .hri^ ,v' ..nlri ik's ph>' 
I'li'i:::.:'. 1/ j.ini'ik' -.iMiiiiL' ID^', ll^'a'l IJ^' 
' n"iiw',t;i pfni'ijiiiiitL' ivinplja'M k's 

'^'m;::;:!'. !;' Km , J'; 1 nii-iuA Si,\iiiiijc. hTj 
!«li!t.i'' ' ■ i:t ilii th; vlLt»|iii.' 

IM:-- I v/;*.l^ 'I '.If l'iJ;'N M\iMllll'S JlS 

\! - I ;H ':;ii'f;' ■ ,||'.:; , [tpHjijimiiL''. 

>l '1 'I:-' Ir llicilli' |)fn- 

I- I: l'l :'M[lliIU' .Ifs U'sis ill' 

I ■ ■ ^)^ Kim' ^iK (iLilliid 

I I 'li tin ,'lli;s i|n[liK' I 

' :i_-^':'Ti'i'. 'ia' .1,111, Li Lmijik' 

i ■ ■ I 1 ' lit' 

.1 ' '. . ■ 1' I' I • . .|;ll'lk■^ Hi'iK's 

" ■ ■ '■ , I 111; :\ 

■ 11^* ,t i 

■ ■ ' ' ■ " ' ''\'' W I !'l'\P[i'ssi():i 
_ ' !■ 'M!!:.'.. '■■ . ■ J '■:i[[fUk's ill' 

li ■[ : ,:-!;• > 
I' ;^[>",U I'' ! 'li';. : I'll! 1 1'\ IK'S ilf LimnR's 
"-avJ ■ I'l - : n'l.l.r■^ jvmljlil niu' 

I "i''^^ ^ h i >' ifilii'^* ,n;fjiu!l Li Ji' 
'h'liii:;.!--: ■ ji-.in'iii^'iii ,k' Inir nilri'MU' 
'i<'h;!ivi.'." 1 in ''iii;>!.>;,t'iii pnf-nn'l un j 1 an!-; 

ih' ;cMl[VlCIM' ilOli'IlllllL'S lofs (k'S ClltrCVIlOS 

.i'[Mif .>>iiiiniiiiii|iK'> ulli.kikMiK'ni |U'ii(Liiil 
ik' .iiiiiiv I'fi niniu' H'fiips k's iiotvs 
■ ihinm^ il.iiis I-; ,.h|p; ill) i'ruiir.iiiiiiic (k's U'sis lic 

Rl'lhk'IIH'Ill S,..'.||IC S^MKllLlI!;', >'l\\k' 



LI S TI;STS hCRITS 

P.nif lj ialiiro cr L rcdj^nin, on iiliiisaa ik's k-sis 
] itiiillipks. Us t'kVvs di^pDserofit dc 
LiiujiutiU' niiniiicspuur fiiiirdijquolcsl 



IISI, 1)L LHCrURh 




li' ilo k'i'lurc ioniprt'nilra,dos quosiiDiis di' 
di'iix p:ms' \\\\wM\m on amiaic d 
IbU'iMiiprclK'Hsiun do tovlos dioisis ^ar dos 
i'iisoiiinjn[> dii Nniivoau-BrumwK'L l.cs IoMov d 
k'S qiicsimns s'y r:jppt»rianl ropriviKonmi phi- 
sioiirs iiiviMiix do dilliaillo. I.i' lost do kvliiro 
ainiioridM(t()i|uos[i()n,s 



Tl-ST l)H Rl-:i)A('liON 




lo losi do raholioii sora uno mosuri' indircoto do 
l;i Qpjoilo dos I'levos d;ins lj rodaiiion. II 
o\jnitnorj lour luhikif a disiinijiior parnii Ics 
sirucliiros i^enorjloinor,* oons:doroo.s iiiiporljnios 
dans 1.1 rodaiHoii do l;i langiic socondL' oi i los 
ohiiisir solon dos oontcxtos donnos. II y aiir;i trms 
f gonros Jo ipjosiion.s (alusajio. iMoorrtvlion do 
phrases o( U'l phrjsos a ooniplolor. Ccs qiioslions 
jiimni pliiMoiirs nivoanx do dilTtailloot porloronl 
siir uno variclo do problonios graiiiinalkaax oi 
siylisiujiios. [,o losi do rodaotion oonliondra HO 
'|iiosiii)ns, 



l.l'..S l;NTP,liVl!|:.S 1)1- 

coMPiniNd LiNgiiisiiQUi-; ■ 

Los onlroviios do lonipoionoo Inijiiiisliqiio soroiii 
dunnoos. dans dos CDiidilioiis standardisoos. par dos 
ofisoignanls dii N«)iivoaii-Briiii.suivk qui inil olo 
fornics jDiiinio o\;tinini'loiirs do !:ini!iio.s sooondos 
all oi'iirs d'lin pru;;rj.nino do fnrnialton olahli par 
lo Minisloro do iTduLaiHin ol I'diioalinnal Tosiini; 
Sorbite, Soids k'S olovos Jos otuirs do lan|:iios 
sooondos dos 1 1^' ol I y aiiiioos aiiriMiI dos onlro- 
vuon do Lingiio dans lo I'aJro dii PtDgrannno dos 
Icsis do laipoN Sootifulos, (liailiio oiilioMio Jo 
oonipoiofK'o d.iih Li liniiuo dmiiiora a foiou' 
l\iooasi{)ii do doiiionUor. iLins uno amvorsalntii 
r'A'lk ol iialiiriilo. lo {Um do Min hahdoioa parloi 
Li hiiiiiiio soi'oiido ol a in niniproiidro rovprossmn 
'>r.do 1 0 mlm proas dos onlrouics no sora '\is 
prodoiorniino. // wu: Jifiu- ifmlili ijiu /n llois w 
p>!'rJ\'Kl" juiur k'\ ail)c\i{i\, s\ 00 il'osi do 
p.ii(ioipor a dos iDnvorsaiions nMuhlaNos los 
diiniainos do I'^iinpflonu' siiivanis Noriiiil o\V 
iiiinoN, pruniinoialiiin. prooisum graiiuitalioalo. 
u.L'abulaiLO. laoililo d o\prosMoii ol ooiiiprolionsum 
aiidilno. I'Ho ooliollo \k ooinpolotioo sora ulilisoo 
daih chai[Uo dtunaino do oninpoionoo liiiiiuhliqiio. 
los lUiU'N soronl oiahlios pnur olia(|iio olovo ol 
addiiiiiiiiioos sokni un baroniodo oiiolTkionis pro- 
doioniiiiio, Lo dUal sora onsnilo C(Mivorii on uno 
nitio ^onoralo CiirroNpundanl a un dos oniq nivoaii\. 
dooniipolonoooniinioK'NOi-apros. 

MVLAl^X CIM kACX |)|- ('{niPI-ILSd 

Nivoau 1: Poul salhbiro aii\ ksoins siinplos dii 
vo\a|ic ol aii\ o\i|:oiia's niiinnia tk- la 

uUI,'lo|slO 

\iUMli J' IVui salLslaiio aii\ o\tfOlK\s snoialosdo 
baso 01 aii\ ivsiHiis snnpk'N sc up- 
porlanl.llWok'olau travail, . 

Niveau } IViil park'r la lanijno avoo sulTisjiiiiionl 
do proiisiiin siruoliiralo oi loMquo pmii 
p,inioipor avoi suooos ilaiis la pkipari 
dos L'orivorsalmns olfioiollos oii nnli- 
nairos siir do.s siijois pralKjiics oi: 
^i)oiaii\. 

N|voaii4 Poul ulilisor l:i ijngiio oourainnionl ol 

avoo pri'oision a (nus h mm dos 
■ oofivorsalums auiraiKos ol spcoialisoos 

dans lo di)ni;iiiio praliqiio on sociai. ou 

iDUolianI ail (ravail. 
VivodU V SVxprinio avoo iirio lauLilo ogalo a oolk' 

d'iino porvMino inslriiil.!. nalivo do la 

laniiiio. 



USING THE FSI INTERVIEW 
AS A DIAGNOSTIC EVALUATION INSTRUMENT 



Stephen L. Graham 
Brigham Young University 



USING THE FSI INTERVIEW AS A DIAGNOSTIC EVALUATION INSTRUMENT 



Stephen L. Graham 
The Language Training Mission 

The Language Training Mission (LTM) is located in Provo, Utah, 
adjacent to the Brigham Youny University (BYU) campus. It was established 
to provide intensive language and cultural training for missionaries 
of the Church of Jesus Chriat of Latter-day Saints (Mormon) who serve 
voluntary, two-year missions in many countries of the world. 

Instruction began at the LTM in 1961 in Spanish and^ince that time 
has expanded to include Afrikaans, Cantonese, Danish, Dutch, Finnish, 
Flemish, French, German, Icelandic, Indonesian, Italian, Japanese, Korean, 
Mandarin, Navajo, Norwegian, Persian, Portuguese, Samoan, Serbo-Croatian, 
Swedish, T^hitian^ Thai, and several Indian languages spoken in Latin 
Americas Aymara, Cakchiquel, Guarani, Quechua, Quiche,. and Quichua. 

Five to six thousand missionaries are trained annually at the LTM in 
the languages mentioned above. The instructional staff is composed almost 
entirely of students at the university who are working their way through 
college. They are either native speakers of the.- languages or returned 
missionaries who have recently completed their missions and are at BYU 
pursuing their education. The number of language instructors at certain 
times during the year reaches as high as 300. Also included on the staff 
are 75 to 80 certified testers who conduct FSI interviews on a regular 
basis. 

With the exception of approximately 100 missionaries a year who 
receive additional training, the fPissionar ies learn one language and 
receive cultural training in an eight-week period of time. The mission- 
aries are housed at the LTM and are required to speak their language for 
most activities during the day. This provides an ideal situation for 
total immersion In the language. 



FSI Interview Adopted as Evaluation Instrument a£ LTM 

Early in the spring of 1975 the Foreign Service Institute (FSI) 
interview was adopted as a major evaluation instrument at the LTM to help 
determine the overall language proficiency of the missionaries going 
through the program. Thei wore three main reasons for its adoption: (1) 
the FSI interview is a 'well -designed , well-respected instrument, and 
provides a means of comparing results in oral language proficiency with 
other language institutions; (2) it is relatively simple to administer 
across different languages and, with periodic in-service workshops, 
quality control can be maintained within and across languages; (3) the 
"interview setting*' is ideal for giving immediate, individually tailored 
feedback to the person being interviewed. 

Protase E. Woodford of Educational Testing Service (ETS) conducted 
the initial training for the first team of Spanish testers earK in 
1975, and by mid-.lanuary 1977 certified testers had been trained in 



ERIC 



-34- 



the twenty-one languages being taught at that time. Regular seminars 
Tor retraining and in-service workshops have since continued, including 
a two-day seminar in August 1977 that was given by John L. D. Clark 
of ETS. . ' 

Upon arrival at the LTM, missionaries who have had prior experience 
in their target language receive an "entering FSI interview." All mission- 
aries, without exception, receive a " leparting FSI interview" at the 
conclusion of their LTM stay. Those who desire interim interviews for 
diagnostic purposes have this option available to them at any time during 
their stay. Scores are not recordejj for the interim interviews; the 
emphasis is on giving useful feedback. 



FSI Tester Training at the LTM 

The training of FSI testers at the LTM is conducted in three seg- 
ments: (1) acquiring rating skills, (2) acquiring interviewing skills, 
and (3) in-service and retraining to maintain those skills. 

Rater training is provided through a self-instructional package 
entitled "Oral Language Prof iciency Test Training Manual" (Part A), pre- 
pared by the LTM, The manual is accompanied by several sets of practice 
tapes (prerecorded and prerated FSI interviews) and a set of certification 
tapes. The trainee checks out the materials and works through them at a 
comfortable rate for him. The practice tapes give him an opportunity to 
practice his rating skills by assigning ratings to actual prerecorded 
interviews and then comparing his ratings with those of experienced 
testers. 

To move ahead into the training program for interviewing skills, the 
trainee must correctly assign FSI ratings 'for the prerecorded interviews 
of the certification tapes. 

Interview training is provided on an individual basis as well. Each 
trainee works in an apprentice-type situation where he receives personal, 
'on-the-job training from an experienced tester. He begins by watching 
interviews that have been videotaped and by observing live interviews 
conducted by the experienced tester. The trainee then begins partici- 
pating in'actual interviews until he feels confident in conducting an 
effective interview on his own. 

The emphasis of the interview training is to ensure that the tester 
provides a comfortable atmosphere in which the missionary is able to. 
perform at his maximum capacity in the language . 

I n-service workshops are conducted every two months to provide 
follow-up training and remedial help where needed in both rating and 
intGrviewinq skills. Activities of the workshops consist of conducting 
actual interviews on the spot and rating interviews that have been* pre- 
recorded on audio and video cassettes. Ratings are assigned independently 



-35- 



by each tester and the results are then discussed as a group. Testing 
teams representing all languages taught at the LTM are present at the 
workshops. 

English is used for all initial training and workshop ^sessions. This 
does have some disadvantages in that the majority of testers are not 
native English speakers, but it helps maintain quality control across 
languages. Using English also helps keep the focus of the workshops on 
rating a person's ability to perform certain tasks in the language and 
avoids the myriad "linguistic" concerns that sometimes are raised when 
dealing with so many different languages. 



In-House Evalua t ion of F5I Testing Program at LTM 

At the close of 1977 (the first full year of FSI testing in all 
languages taught yt the LTM), the administrative staff conducted an 
informal, in-house evaluation of the FSI testing program. This was to 
determine how well the'program was fulfilling the three main purposes for 
which it was adopted. At the conclusion of the -evaluation, the staff was 
encouraged by the quality and consistency of the testing results. Concern 
was expressed, however, about its usefulness in providing helpful feedback 
to the missionaries. A summary of the evaluation results follows: 

During 1977, 6,193 FSI interviews were conducted in twenty-four 
languages. This number includes both "entering" and "departing" inter- 
views. Of a randomly selected 763 interviews conducted in French, 
German, Japanese, and Spanish between the months of January and June 1977, 
there were only 156 discrepancies between independent ratings assigned 
by the interviewer and the rater before consultation. Of those 156 
discrepancies, 155 were no larger than a "plus." In other words, LTM FSI 
testers in these four languages agreed on the exact ratings 92.7 percent 
of the time- without consulting each other. In the few cases where there 
were disagreements, the difference was rarely more than a "plus." 

The reliability of ratings across languages is a topic of every 
bimonthly workshop. As mentioned earlier, th'^* majority of testers are 
not native English speakers. All training on this level, however, is 
conducted in English expressly for the purpose of ensuring consistency 
across languages. This is done by having all testers independently rate 
prerecorded interviews from a variety of sources: For example, ETS 
recordings are frequently used, along with those prerecorded by various 
teams represented al ^he workshops. 

As an example of tester performance during these regular workshops, 
the results of the most recent one, held in February of this year, are 
of interest: Of 102 independent ratings assigned during the workshop 
prior to consultation, 95 were? in agreement, with only 7 ratings being 
either a "plus" too high or too low. Several of the interviews used for 
rating during the workshop were prerecorded on audio cassettes, others 
were recorded on video cassettes, and one interview was conducted live. 



.1 



-36- 



The results of the evaluation up to this point indicated to the 
administrative staff that the general operation of the FSI testing program 
was improving both within and across languages. They also showed that the 
initial and in-service training programs for testers had become systematic 
and quite effective. 

In addition to having a smoothly functioning FSI testing program with 
adequate training for personnel and reliable ratings, one of the goals 
of /the administrative staff is to provide missionaries with as much 
diagnostic help as possible during their LTM stay. This should enable 
them to increase their language proficiency significantly before leaving 
for the countries to which they are assigned. 

During February 1978, feedback was elicited from language instruc- 
tors, testers, and missionaries to determine the general feeling about how 
much diagnostic help was actually being given. Three weaknesses were 
consistently mentioned and confirmed by observing actual interviews. 
These weaknesses were: 

1. Lack of sufficient time to follow up on deficiencies. (Most of 
the interviews are given to missionaries three or four days prior to their 
departure for the assigned countries.) 

2. Lack of a systematic procedure for the tester to organize the 
feedback in a usable format for the missionary. 

3. Lack of a systematic procedure for getting the feedback back into 
the instructional program and ensuring that problems are remedied as well 
as diagnosed. 



Procedures for Prcvidinq Systematic Diagnostic Feedback 

In an effort to facilitate the flow of useful, systematic feedback 
both to the individual missionary and into the instructional program 
itself, the following changes and modifications are proposed: 

1. The FSI ''entering" and "departing" interviews will no longer 
be conducted for every missionary. They will be conducted, rather, 
on a random selection basis to provide the administrative staff with a 
continual flow of statistical data for purposes of evalu?ation. 

2. Each missionary will receive an interim diagnostic FSI interview 
during the third and sixth weeks of his stay at the LTM. These interviews 
will be conducted in the same manner as the regular FSI interview, except 
that diagnostic feedback will be given to the missionaries in lieu of 
FSI ratings. 



■1 -1 



-37- 



3. Testers will be provided with a diagnostic feedback checklist 
specific to their language. This sheet will be used to record patterns of 
deficiencies in a missionary's speech during the interview. The form 
will be prepared in triplicate. At the conclusion of the interview one 
copy will be given to the missionary for his own personal reference, 
one copy will be sent to the instructional staff, arid one will be retained 
in the testing center. 

This form will provide a means for the instructional staff to watch, 
for high-frequency items indicating specific areas of deficiency unique to 
that particular language. Mini-classes will then be conducted during the 
personal study time of the missionaires , and the most common errors in 
grammar principles, vocabulary, comprehension, fluency, and pronunciation 
will be treated on an.individual and a group basis. (An example of 
the French diagnostic feedback sheet is included as Appendix A.) 

Conclusions 

The administrative staff feels these modifications in procedures 
will greatly enhance the usefulness of the FSI interview in a practical 
way without changing the test itself or the purposes for which it was 
designed. It is important to the Language Training Mission to be able 
to compare results in oral language proficiency with other language 
institutions. 

It is expected that the diagnostic feedback sheet will need periodic 
revision and ,modif icat ion with respect to both scope and layout. These 
changes will be made as needed over the next few months in a trial run. 
The idea, however, of taking full advantage of the "interview setting" for 
giving personal, oral feedback to/ individuals is the intent of the 
recommended changes. The emphasis on "oral evaluation" is especially 
important at the LTM, where the emphasis in language training is on 
acquiring speaking and listening comprehension skills. 

The FSI interview testing program (both diagnostic and traditional), 
accompanied by the traditional written testing program, will provide 
the LTM with useful formative and summative evaluatiori data. Both are 
essential to ensure individual improvement for the missionaries and to 
upgrade and modify instruct ional programs and materials. 



Appendix. A 
FSI DIAGNOSTIC FEEDBACK - FRFNfM 



DATE 



INTERVIEVyER 



Z'ltlj^. "^"^ "^'"^ '''''' ^^'"^ INTERVIEWED HAS DIFFICULTY. mKE ADDITIONAL COt^ 

^ OBSERVATIONS IN TVC SPACE PROVIE^. THIS EVALUATION WILL BE USED IN GIVING RE^«IAL HELP TO^^ 
BEING INTERVIEWED TO INCREASE HIS PROFICIENCY. . ' 



speaks in Infinitive or with no verbs at all 

preposition d 

preposition de 

^ ^ '^'3, l&a = z-u, anx 

^ lea - du, dea 

-i^oir and erre .as auxiliaries 

direct object pronouns Le, la, lea 

indirect object pronouns lui, Leur 

^ ^ 3nd iepuia used with time 

dnd i2ntf used with time 

y'eai and il est 

_definite article as in la .jhariti 

''e^'..J:jrxzia, ne . , , pe raoriKe , etc. 

prepositions of place en and au 

^ 35 in "Je n'zi pas oe. . . " 

adverbs vs. adjectives as with 

jorr^ct and correctemant 

reverts to English word order 



-Jonrai^re vs. sivoir 

parler vs. Mve 

bien vs. ben 

peuple vs. jena 

_^foz:s, terrptj, I' heure , ncment 

:j^jcr-i vs. "OK" 
^oiu' and pendant and time ^ \ 

_jhar.ger, 'jhanger da, and changerrLnt 
_:zvant d2 vs. ivant que ... / 
^zpi'ta avoir vs. apr^a Stre ... 
_:^oua and tu vs. impersonal on 
_pLu8, trt?a, and trap 
jTvieux vs. r.Tetlieur 



verb endings 

present tense 

past compound tense 

future tense 

imperfect tense 

conditional tense 

COMMENTS: 



^subjunctive mood 

^adjective agreement 

agreement of past 

participle 




Makos the following common errors: 

J t tendre "po ur " 

chercher "pour" 

"pour, aur" 

binir "avec" 

avoir beaoin "pour" 

"plua" rrtieux 

COMMENTS: 



iinfont^'.'?e"lS'top'?cir"'' " expression: when discussing his area of expertise 



_Cannot describe objects, feelings, and situati 
expression. 



ons when he does not know the specific word or 



LTiM 6/26/73 



EKLC 



-39- 

FSI OIAGNQSTIC FEEDBACK - FRENCH (CONT.) 




^^^hen the interviewer.spoke at his rormal speak-'ng speed the interviewee had difficulty following 

-him, 

^''^hen the Interviewer spoke on general topics other than those very familiar to the interviewee, 

'Che latter understood isolated words and expressions but generally did not understand the full 
context of ideas. 



1. /then the interviewer spoke on the topic of 



, the interviewee 



g , 2. Ahen tne interviewer spoke on the topic of 
£ - ^ 



3. ^hen the interviewer spoke on the topic of 



the interviewee 



the Interviewee 



o 4. When the interviewer* used Che word (or express 



ion } 



i.r,e interviewee 



5. When cne interviewer used the word (or expressi 



6. When the interviewer used the word (or expressi 



on 



the interviewee 



, the i nterviewee 



The items which are :;iecked below describe the fluency of the interviewee's language: 

difficulty speaking at his cwn natural speaking speed Speaks at natural speed • 

Pauses are unnatural and illogically placed ^Pauses natural and logical 

Speech is irritating and annoying to listen toover long period of time Speech not annoying 

P^r^ises are broken and incomplete Phrases snrooth and complete 

Speaking genera 1 ly requi res a great effort on the part of the interviewee Speech is effortless 



The person being interviewed has difficulty with the items below which are checked: 



as in 

y js in 

-iS in 

^ s in 

iS in 

(oDen) as i n _ 
•closed) as in 
u as in 



liasons : 

optional as 



obligatory as in 
Drohibited as in 



Nazals: 

.'K I -ir: as in 

i*i i as in 



^s in 



un as in 



"ja" between vowels: inpression - 

"j" between vowel and consonant: enthousiaame. 

"e" as in j£ lever, reyenir, lievezj 



i I e before double consonant as in innocent. 



:> before double consonant as in bonne, occuper. 



COMMENTS: 



DIRECT TESTING OF SPEAKING SKILLS 
IN A CRITERION-REFERENCED MODE 



Robert B'. Franco 
Defense Language^lnstitute 



DIRECT TESTING OF SPEAKING SKILLS 
IN A CRITERION-REFERENCED MODEl 



Robert B. Franco 

Background 

The Defense Lang^uage Institute (DLI) and its predecessor, the Army 
Language School (ALS), .have traditionally emphasized the development of 
oral skills in their foreign language programs. Although in the past few 
years other primary objectives, of a miflitary-technical nature, have been 
pursued, the main emphasis has remained on developing speakers of foreign 
languages to an 5-3 level of proficiency. Ironically, the speaking skills 
have been the. most elusive and difficult to measure with a satisfactory 
degree of objectivity. 



Historical Perspective - 

At DLI the search for an effective system^of evaluation of speaking 
skills can be traced back to the days of the Army Language School and 
extends until the present time, but for the purposes of this paper, the 
period will be divided into pre-1976 and post-1976 segments. In our 
pre-1976 couses, the core of -the lesson unit was a "basic dialog," charged 
with presenting certain grammatical features within . the context of a 
high-frequency, authentic situation. Traditionally, the dialog was 
introduced in class, then studied until "f ully . understood" and mei/iorized 
at home. The next day, the dialog was reviewed and enacted in the class- 
room, as realistically as possible. A good iijiitation by the student of 
the native model's pronunciation and fluencj?, an indication of a clear 
understanding of what was being said, plus tf»e native-like use of impor- 
tant pa r al i ngu i st ic features, constituted the evaluation criteria. 

The acceptability of the student's performance depended on .the 
powers of observation and the subjective appreciation of the instructor. 
Furthermore, an acceptable performance in class was recognized as suf- 
ficient proof of the student *s capacity to perf orm effectively on the 
job. 

Cognizant of the subjectivity that permeated this method of evalu- 
ating speaking skills , ^ALS/DLI instituted a less informal system, which 
included weekly, monthly, and final oral examinations. The weekly tests 
consisted of a series of questions based on the materials covered during 
that week. These questions were read aloud by the instructor, who then 
noted the accuracy and completeness of the student's responses. For the 
monthly and final examinations, one or two bilingual conversations were 



J-The views of the author do not purport to reflect the position of the 
Department of the Army or the Department of Defense. 



4 I 



-44- . 



added in which the exami lee played the impromptu role of interpreter. 
Notes and tallies were kept, but the scoring was still -based on a subjec- 
tive appreciation of the examinee's performance, even when an examiner 
other than the classroom teacher was the scorer. As part of the system, 
the oral score was computed with the scores of pencil-and-paper tests 
given for other skills, and a composite of all test scores was then 
computed with the average of the daily grades for the testing period. 

Somehow, our good teaching survived our poor testing, at least within 
our system. To illustrate,' in 1973 we took a ten-year block of these 
composite scores of approximately 1,000 Spanish basic course students and 
.Compared the scores with those obtained by the same students on the 
listening comprehension part of the Defense Language Proficiency Test. 
To our surprise, a correlation of .91 was discovered, - Ithough the cor- 
relation for other languages is about .60. This relievcu us momentarily, 
but of course did not validate our system. 

In the late fifties- and early sixties, our expectations were raised 
by the development and refinement of the Foreign Service Institute (FSI) 
"techniques for the testing of speaking proficiency," followed by publica- 
^ tion of the Modern Language Assocation (MLA) Cooperative Foreign Language 
Tests and the Modern Language Association Proficiency Tests for Teachers 
and Advanced .itudents. DLI examined the new instruments very carefully, 
tried them out, and adopted their formats with the modifications required 
^ by the nature of our student population and their special needs. 

For the pre-1976 Spanish basic course, specifically, we adopted the 
Fbl model and used it, experimentally, as a proficiency, placement, and 
achievement test. However, its full utilization was inhibited by two 
factors: the limited scope of our basic course (with a final objective of 
5-3; and the absence in the course design of interim objectives that would 
have addressed the S-l and S-2 levels chronologically and permitted 
diagnostic use of the structured oral interview/based on FSI techniques. 

We found the MLA speaking tests were not as readily adaptable to the 
Spanish basic course, mainly because the tests had a different content and 
employed techniques with which our examinees were not as familiar. As 
with the FSI interview, the internal structure of the course was also an 
inhibiting factor, although this was later remedied in the new course 
deaign. Features of the MLA'model, nevertheless, were incorporated into 
the "level tests" developed by DLI and Educational Testing Service. 

The New Spanish • Basic Course, Post-1976 

In the mid-seventies, a new DLI Spanish basic course was designed 
and developed under the growing influence of a criterion-referenced 
instruction (CRl) approach, derived from the Interservice Procedures for 
Instructional Systems Development (IPISD), a model produced by the Florida 
State University under a joint interservice contract. Thus, a system 
designed primarily for mi li tary instruct ion was transplanted into the 
foreign language curriculum. 



ERLC 



In addition to thi-s CRI general orientation, the new design addressed 
the sequential achievement of skill levels I and II as interim objectives,, 
keeping skill level III as the f inal object ive of the basic course; 
Schematic diagrams for the pre- and post~1976 course design are shown in 
Appendix A. 

Course Design 

The course consists of nine general modules and one enrichment/ 
remedial module, to be covered in no longer than twenty-seven weeks. 
Modules 1, 2, and 3 address skill level I; modules 1 through 6, with 
emphasis on 4, 5, and 6, address skill level II; and all nine modules, 
with emphasis on 7, 8, and 9, aim at skill level III. The evaluation 
track includes nine module tests and three level tests*, with the level 
3 test complemented by a comprehensive achievement test, the Defense 
Language Proficiency Test (DLPT), and a structured oral interview, limited 
to skill level III, In addition, each of the six lesson units in a module 
contains a series of criterion checks for the evaluation of stated lesson 
objectives, with emphasis on the communicdtion frame to check speaking 
ability. A separate track of criterion-referenced checks evaluates 
listening comprehension skills. . | 

New Evaluation Design 

The field test of the materials indicated the need to consolidate 
the various types of tests into a comprehensive, criterion-referenced 
evaluation track. 

The new track combined the best features selected from each of the 
previous components. This selection was based primarily on student and 
faculty input that was, admittedly, personal and subjective. The result 
was a battery of partly norm-referenced and partly criterion-referenced 
tests called Comprehensive Hybrid Achievement Tests (CHATs). Our new 
technology, however, required a clearer CRI orientation, so we reexamined 
the objectives and the criteria, and adjusted the instruments. This 
produced the present Major Criterion-Referenced Tests (MCRTs): Anchor CRT 
1, Anchor CRT 2, and Final CRT, which evaluate the attainment of the 
objectives assigned to skill levels I, II, and III, respectively. Neither 
the module tests, the lesson unit quizzes, nor the listening comprehension 
CRTs were modified, but closer coordination was recommended of lesson 
objectives, communication frames, and the speaking MCRTs. 

The MCRTs test seven component skills independently. Speaking is 
listed, arbitrarily, as* number IV. The content outline for the complete 
MCRr battery is shown in Appendix B. 



The Speaking MCRTs 



Specifications, A complete set of specifications for the Spanish 
Speaking MCRTs is included in Appendix C. 

£ormat_. The speaking test consists of 3 two-part oral interview 
between an examinee and one specially trained native speaker in Spanish. 
The first part of the interview is related to specific topical areas 
about which the examinee has knowledge. Spoken Spanish responses by the 
examinee are elicited by spoken Spanish questions or statements by the 
interviewer and systematically based upon the list of topics. 

The second part of the test is conducted in the same manner. Instead 
of topics, role-playing situations are utilized to form the basis for the 
examinee's responses. Both topical areas and role-playing scenarios are 
printed in English in the test booklet that is given to the examinee' at 
the beginning of study for. the modules to be tested. A separate booKlet 
IS provided for the interviewer to provide the information necessary to 
prepare, conduct, and score the interview. 

During the study of the modules to be tested, the student is 
encouraged to act out the scenarios pertinent to each lesson and to 
be checked out by- his or her instructors. In factr<he students them- 
selves have developed a check sheet for each role-playing situation and 
concentrate their attention on those scenarios that are net specifically 
covered in the communication frames of the lesson CRT. \ 

Co^itent. As stated earlier, the Spanish MCRTs parallel the objec- 
tives and content of the basic course. Anchor CRT 1, for example, 
addresses tasks derived from the definition of skill level l\in speaking 
that correspond to the speaking objectives of modules 1, 2 apd 3, which 
are the targets of the test. \ 

To illustrateV 

Level I objectives (S-1 tasks): 

\ 

1. Use greetings and leave-taking expressions. Offer 
apologies. \ 

2. ,Make simple social introductions of self and ^ 

others. ^, 

3. Ask and tell time of day, day of week, date. \ 

4. Order a "simple" meal. \ 

' , \ 

And so forth. 



-47- 



Elements of task 4, for instance, have been assigned to lesson 7 
as its speaking objective, within the format and criteria of effective 
role-plaxing of restaurant scenarios. To verify the achievement of this 
objective, after all enabling objectives have been satisfied, the student 
is tested in the four, role-play ing situations of the communication frame, 
which is the lesson's speaking test. Also, while working in the first 
three modules of the course, the student prepares and is checked out on 
the -six role-playing situations included in Anchor CRT 1 for task 4. 
Thus, when the test is formally administered, a passing score on any of 
the six scenarios would satisfy the requirements of this task. 

This close parallelism may constitute one of the best features of the 
Spanish MCRTs. 

Administrat ion . The test is administered in the form of a structured 
oral interview. The interviewer must be a native speaker of Spanish and 
specially trained to use this technique. Though structured, each inter- 
view is unique. For this reason, standardized alternate test forms 
employed for measuring the other skills in the Spanish MCRT series^ are not 
used in the speaking test. , 

\ 

Separate guides have been prepared for Anchor CRT 1, Anchor CRT 2, 
and the final exam-inat ion , each with examiner's and examinee's versions. 

a. Examiner's guide . Each guide provides detailed information on 
the procedures to be followed and supplies the topical and situational 
information th'at give the examination its structured elements. It 
is essential-: that interviewers administering these speaking tests be 
thoroughly familiar with the contents of both the examiner's guide and 
the examinee's guide. 

t>. Examinee's guide . Each examiner's guide has a companion exam- 
inee's guide. The guide for the examinee provides procedural, topical, 
and situational information and is given to the student when he or she 
begins study of the modules with which: each guide is associated. The 
student is instructed to. become familiar with the contents of the guide 
and to bring the guide to the test site. Each guide also contains a 
removable student rating sheet. Its use will be described in the section 
about scoring. 

c. rime allocation . Time allowed for administration of the speaking 
tests is indicated in the examinee ' s guide , the examiner's guide, and in 
table 1 of the administration and scoring manual prepared for the MCRTs, 
as shown in Appendix D. 

d. Observers . The Spanish MCRTsi are designed for use in a face- 
to-face, one examinee/one interviewer ; s ituat ion . The presence of an 
independent scorer, an observer, or an i interviewer trainee Is permitted. 
Any such third person present during the interview must remain silent 
and unobtrusive. ; 



EKLC 



e. Recording . Recording the oral interview is permitted. These 
recordings may be used for independent scoring, training interviewers, or 
rating interviewee performance by another rater. Most reel-to-reel and 
cassette tape recorders have only a single microphone input jack. For 
this reason, the microphone must be carefully placed so both interviewer 
and examinee voices will be recorded. Preadjusting the equipment under 
actual test conditions is recommended. 

ScvO rinq . The speaking tests may only be scored by trained scorers 
who have expert knowledge of the Spanish language. Full details on 
scoring the speaking tests are contained in the examiner's guides that 
have been prepared for each Anchor speaking test and the final speaking 
examination. Since no two interviews are conducted identically and 
examinee responses can vary, the speaking test is not arranged in 
standardized alternate forms. A separate rating scale has been prepared 
for each Anchor test and for the final test. Appendix E shows the' student 
rating sheet for Anchor CRT 2. Similar sheets (with different rating 
level weights and percentage conversion tables) have also been prepared 
for Anchor CRT 1 and the final examination. While speaking is subject to 
minimum acceptable performance standards, a special provisiqn has been 
added to these tests so that examinee performance can also be expressed as 
a performance skill level. 

a. Ratings . Performance rat ings are used to derive skill points 
from which the score is determined. The procedure is the same for the 
Anchor tests and the final examination. A three-point rating scale is 
applied to five linguistic categories in accordance with the statements of 
performance criteria. The ratings based upon the examinee's performance 
are not language skill levels, but points from which to derive a score. 
It is this point score that can be converted to conventional language 
skill levels, to percentage grades, or to pass/fail grades. A separate 
rating sheet is provided for each speaking test to reflect slightly 
different weights for certain linguistic categories. The procedure for 
using the rating sheets is the same for all speaking tests. 

b. Computation of Points and Score Conversions , The examiner is 
required to use the following procedure: 

1: Using the computation table at the, top of the rating sheet, judge 
the examinee's performance on each of the five linguistic categories, 
determine the number of points derived by using the appropriate rating 
column (1, 2, or 3), and enter that number of points in the space provided 
under "Skill Points." Add the column of skill points. This produces the 
examinee's point score. 

2. The score-to-level conversion table is. located at the lower 
left-hand side of the rating sheet. Using the total number of points 
scored, circle the appropriate level opposite that band of scores. Enter 
the skill level attained in the space marked "Skill Level" at the. bottom 
of the page. 



-49- 



3. The score-to-percentage conversion table is located at the 
lower right-hand side of the rating sheet. Using the total number of 
points scored, circle the appropriate percentage score for points scored. 
Enter the percentage score attained in the space marked "Percentage 
Score" at the bottom of the page. 

4. Based upon the minimum acceptable performance standard for 
speaking, check the "Pass" or "Fail" block at the bottom of the page. 
The criteria for each linguistic category were adapted from the 
definitions previously used at DLI, derived primarily from the FSI 
interview materials. Performance criteria for Anchor CRT 2 are reproduced 
in Appendix F. 



Va iidat ion 



The components of 
July 1976 and November 

experts, these MCRTs are considered validated 
monitored to ensure that they continue to meet 



the Spanish MCRT battery were produced between 
1977, and, on the assurance of subject matter 

by DLI and are being 
design criteria (the 



concept of "internal va lidat ion" vs. "ext ernal validat ion") . By F ebruary 
10, 1978, the tests had been administered to only 111 students, with the 
following basic results: 



MCRT 1 
MCRT 2 

MCRT3 



N 
N 

(N 
N 



50 

15*) 
15* 



Pasoed = 45 
Passed = 43 
(13) 
Passed - 14 



Failed = 5 
Failed = 3 
(2) 
Failed = 1 



^These students (Class 01LA24W0977) were not administered CRT 1, 
because the test was not available when the class reached the S-1 
leve I . 



Admittedly, this is too small a sample to ensure utility for external 
uses, but it is considered sufficient for DLI purposes. Furthermore, the 
initial reaction from both examinees and examiners is encouraging. 
Following are a few of tfie comments gathered to date about the test: 

"It measures the functional competences stated as learning 
objectives." 

"Both the limit ed scope of each CRT and its use of content- 
sensitive scenarios tend to guarantee a fuller exploration of the 
stated objectives Qhan is true of other tests used previously^ . " 

"The student is encouraged to be cl^ecked out by the instructor 
on each of the interview topics and role-playing scenarios one by 
one, and to uj^e this informal appraisal of his or her performance 
dlaqnostically for immediate remediation." 



-50- 



"Role-playing is preplanned, integrated into the cours^, and is 
not a surprise at the time of the test." 

"Because of scope limitations, no exploratory time is required, 
greatly reducing administration time, especially for MCRTs 
1 and 2." 



"Examiners must use the student rating sheets to assign S-ratings 
and other scores. Thus, 'experienced judgment' plays a lesser 
role, which tends to reduce the subjectivity of the scoring 
system." 



"The tests appear to have 'inherited' the validity of the FSI 
interview, and could perhaps surpass it." 

These opinions will be corroborated or disclaimed through our mediation 
and monitoring procedures. Meanwhile, several test features have been 
identified for critical evaluation, for example: 

The 70 percent minimal acceptable performance cutoff. (This v/ns 
set by the user agencies, but the test developers feel it could 
be raised, to better equate test performance with on-the-job 
performance requirements.) 

The number of role-playing scenarios and the procedures used 
for the selection of those actually tested. (The procedure 
could include the examiner's review of the examinee's record of 
scenarios checked out, and of any specific job requirements 
known.) , 

The "up-to-date" situational orientation of the interview and 
the role-playing scenarios. (Specific changes in course objec- 
tives dictated by changing conditions in the field will affect 
test content . ) 



Conclus ions 



It has been apparent to the developers of the Spanish MCRTs that both 
examiners and examinees approve of the speaking tests. We have observed 
in the students an attitude of enthusiasm and a sincere desire to prepare 
fully for the tests and to excel in their performance. ^ There seems to be 
no doubt as to the content validity of the tests. As for their predictive 
validity, the criterion-referenced ambir^nce in which the tests are used 
and our informal observation of the initial results provide us with 
encouragement. Nevertheless, in the absence of sufficient data, no final 
conclusions can be made at this time on the overall efficacy of the DLI 
Spanish speaking tests as criterion-referenced instruments. As we gather 
data and develop supportive conclusions, we shall be happy to share them 
with any intereoted persons. 



3l 



SPANISH BASIC COURSE DESIGN - 1975 











Individual 


SEQUENCE 


LEVEL I 


LEVEL II 


LEVEL III 










Needs 

























> 


MODULES 


1 


2 


3 , 


4 


5 


6 


1 


8 


9 


10 


V 
























(D 
D. 



> 



^VALUATION 



Module Tests 


LEV 


Module Tests 


LEV 


Module Tests 


LEV 


Oral 


1 2 3 


I 


4 5 6 


II 


1 8 9 


III 


Interview 



LC 

CRT 

CHECKS 









FINAL 
CRT 




ERIC 



SPANISH BASIC COURSE DESIGN - 1976 











Individual 


S£\^UENCE' 


LEVEL I 


LEVEL II 


LEVEL III 










Needs 



. MODULES 



1 


2, 


■ 3 


4 


5 


6 


7 


8 


9 


10 



EVALUATION 



Module Tests 


CRT 


Module Tests 


CRT 


Module Tests 




LC Checks 


1 


LC cntcks 


2 


LC Checks 


FINAL 












CRT 



Appendix B 

SPANISH MAJOR CRTs 



D E S I G N 



I LISTENING COMPREHENSION 





CRT #1^ 


#2_ 


FINAL CRT #1 


#2 


FINAL 




C-R 




Conversations (3, 


3, 


3) 


10 


10 


2 5 M/C 


I tems 






Broadcasts (3, 




3) 


10 


10 


2 5 M/C 


I tems 




II 


READING COMPREHENSION 


Total 


= 20 


20 


50 M/C 


Items 


70% 




Signs (3, 


3, 


-) 


3 


J 


- M/C 


Items 






Notices ( 3 , 


3, 


-) 


7 


7 


- M/C 


Items 






Headlines (3, 


3, 


1 




3 


- M/C 


Items 






Articles (2, 


2, 


6) 


•7 

/ 


7 


5 0 M/C 


Itcims 










Total 


= 20 




50 M/C 


Items 


70% 


X T X 

III 


TRANSLATION 






15 


2 0 


50 Minutes 






Text (100, 150, 


200 


words) 


20 


30 


4 0 Key 


Words 


70% 


.. . 


(Lexical Aids) 






15 


30 


4 5 Minutes 


• 


1 IV 


SPEAKING 










s-1. 


S-2, 


S-3 




1- Interview/Conversation 


5 


5 


10 Minutes 


70% 




2- Role Playing (2, 3, 4 Sits.) 


10 


15 


20 Minutes 


V 


WRITING 


















1- Completion 






12 


24 


36 Items 


70% 




2- Transformation 






6 


12 


18 Items 


70% 




3- Composition 






1 
20 


2 
45 


3 Comps . 
60 Minutes 


70% 



VI NUMBER TRANSCRIPTION 

1- Five 10-Number Series 

2- Ten In-Context Numbers 



(3 4 5 digits) 90% 
(Card., Ord. & Fract.) 90% 



VII GENERAL TRANSCRIPTION 

Conversations (3, 3, 3) 

Broadcasts (3, 3, 3) 



60 90 135 Minutes 87.5% 



Appendix C 



TARGET LANGUAGE CRITERION-REFERENCED TEST 
ANCHOR CRT' I SPECIFICATIONS 
December 1976 

Speaking 



The speaking test is divided in-to two parts: Part 
direct conversa"' . . . 

playing format. 



, - --'''-v-^iiii^^iwu parrs: h'art \ r\ 

direct conversation/interview format, and Part 2. In'a role- 



I- Part l/S-timulu3 and Task - Given not less th.jn 15 
oral questions sequenced into an i n f o r ma I co n v e r s a t i o n 
covering at least 3 separate Basic Topics from tho.e li'st.d 
in the Examiner's Guide, and presented orally by tho in^er-vi^wer 



Part I/Conditions 



^ . ?• conversation/interview will utilize not more 

than^5 minutes of the 15 minut.s allocated to tho speaking 



b. No lexical aids are permitted. 

.. , Vocabulary and grammatical features used in the 

stimulus must be limited to those covered in the cour e of 
instruction for which the examinee is being measured. 

3. Part I /Or iter ion - 



a. Scoring is accomplished by the interviewer bv 
keeping mental notes or casually noting on the Student 
Rating Sheet the level of ab i I i ty d erron s t r a ted by the 
examinee on each sub-skill. 

^^^^^ ^^''^ I Part 2 have been completed 

the examiner combines his/her observations into one grade fo^ 
each ability and computes the raw score usinq thP 5-1 
COMPUTATION TABLE. (The computation tab,: a^d sco^i g p.o- 
ceaures are provided in. the Examiner's Guide.) 



c. No criterion is prpscribed for Part I but 



30 raw-score cut-off ( equ i va'l ent " to'^^n S^l Le ^ e I i ' i Te s ?a b I i s h . d 
for the entire speaking test. bid.uiisnea 

roln'':,/'n^ 2/Stimulus and Task - Given not less th'an three 
ro e-Dlaying scenarios selected as recommended in.- the Examiner', 
nuide,_fhe examinee will assume the roles indicated in the 
^^nr'M'^'.r' -'^h the instrucWas naturallv 

a n d f I U G n t I V 3 5 D O S 5 i b I e . u r l , / 



-55- 



5 , Part 2/Cond i t t ons - 

a. The three scenarios must be completed within 10 

minutes. 

b. The examinee is permitted to quickly read the 
instructions for the scenario, but the use of lexical 
aidsisnotpermitted, 

c. Vocabulary and grammatical features used in 
the stimulus must be limited to those covered in the course 
of instruction for which the examinee js being measured, 

6. Part .2 /Criterion - 



a . Scor i ng is done a 

b . No criterion is p 
raw -score cut-off (equivalent 
for the entire speaking test. 



s described in 3a and b above, 

rescribed for Part 2, but a 30 
to an S-l Level) js establishe 



-56- 



I n 



I n a 



TARGET LANGUAGE CRITERION-REFERENCED TEST 
ANCHOR CRT II SPECIFICATIONS 
December 1976 

Speaking 

The speaking test is divided into +wo part^: Part 
direct conversat I on/ i nterv i ew format, and Part 2 
role-piayingformat, ' 

■ I. Part i/Stimuius .and Task - Given not loss than 15 
oral questions sequenced into an informal conversation 
covering at least 3 separate Basic Topics from those listed 
in The Examiner's Guide, and presented orally bythe interviewer 
t e examinee will answer the questions oral I , L com 7e te ^ ' 
and fluently as possible. k y 

2. Part i/Conditions - 

than 5 conversation/interview will utilize not more 

than 5 minutes of the 20 minutes allocated to the speaking test. 

b. No lexical aids are p'irmihted. 

Vocabulary and grammatical features used in th 
stimulus must be limited to those covered in the course of 
instruction for which the examinee is. being measured. 



e 



3 . Pa r t I /Cr i ter i on - 

a. Scoring is accomplished by the interviewpr bv 

Sheer?hr?'"/°!^\°^"'"'"^ °" Student Rating 

!^rh . ^ ° ability demonstrated by the examinee on 

eacn sub-ski I I . 

^^^^^ P^'""*" ' and Part 2 have been completed 

e ch'a'bn'-r V'^'^'- = — ^'ons into one grade ^o^ 

each ability and computes the raw score using the S-2 
COMPUTATION TABLE. (The computation table and co ing pro- 
cedures are provided in the Examiner's Guide.) 



^- criterion is p.-escribed for Part I but a 

for'th:'''° + ^ cut-off (equivalent to an S-2 Leve I ) ' i s establ 
tor the entire speaking test. 



shed 



rnIP 2/Stimulus and Task - Given not less than four 

role-play, ng scenarios selected as recommended in the -Examiner's 
Guide, the examinee will assume the roles i nd i cated i n the 
scenarios and conduct them with the instructor as naturally 
cind fluently as possible. a , u r d i i y 



-57- 

5* Part 2/Cond i t ions - 

a* The four scenarios must be completed within 
I 5 mi nutes • 

The examinee is permitted to quickly read the 
instructions for the scenario, but the use of lexical 
aids is not permitted. 

c. Vocabulary and grammatical features used in the 
stimulus must be limited to those covered in the course of 
instruction for which the examinee is being measured. 

6. Part 2/Criterion - 

a. Scoring is done as described in 3a and b above. 

b. No criterion is prescribed for Part 2, but a 
45 raw-score cut-off (equivalent to an S-2 Level) is 
established for the entire speaking test. 



-58- 



^ ^ TARGET LANGUAGE CRITERION-REFERENCED TEST 
FINAL EXAMINATION SPECIFICATIONS 
December 1976 

Speaking 

The speaking test is divided into two parts: Part I, in 

direct con ve r s a t i on / i n te r v i e w format, and Part 2, in'a r 
playing format . ' . 



a 

ro I e 



I. Part l/Stlmu'us and Task - Given not less than 15 
oral questions sequenced into an informal conversation, 
covering at least 3 separate Basic Topics from those listed 
in the Examiner's Guide, and p resented " ora I I y by t he i n te r v i e we r 
the examinee will answer the questions orally, as complete/y 
andfluently as possible. 

2. Part l/Condltions - 

a. The conversation/interview will utilize not more 
than 10 minutes of the 30 minutes allocated to the speakina 
test. H y 

b. No lexical aids are permitted. 

c. Vocabulary and grammatical features used in t'he 
St i mu I u s ^ mu st be limited to those covered in the course of 
inst-uction .for which the examinee is be!ng measured. 

3. Part I/Criterion - 

a. Scoring is accomplished by the interviewer by 
keeping mental notes or casually noting on the Student Rating 
She^t the level of ability demonstrated by the pxaminee on 
eachsub-skill, 

b. After both Part I and Part 2 have been completed, 
the examiner combines his/her observations into one grade for 
each ability and computes the raw score-using the S-3 
COMPUTATION TABLE. (The ccmputa tat i on table and scoring pro- 
cedures are provided in the Examiner's Guide.) 

c. No criterion is p r e s c r i bed f o r Part I, but a 

63 raw-score cut-off (equivalent to an S-3 Level)'is established 
for the entire Speaking test. 

4. Part 2/Sti^mulus and Task - Given not less than four 
role-playing scenarios selected as recommended in the Examiner^s 
Guide, the examinee will assume the roles indicated in the 
scenarios and conduct them with the instructor as naturally 
and fluently as possible. 



-59- 



5. Part 2/Conditions - 

a. The four scenarios must be completed within 20 

ml nutes . 

b. The examinee Is permitted to quickly read the 
Instructions for the scenario, but the use of lexical aids 
I s not perm I tted „ 

c. Vocabulary and grammat leal features used In the 
stimulus must be limited to those covered in the course of 
Instruction for which the examinee is being measured. 

6. Part 2/CriterIon - 

a. Scorl-ng Is done as described In 3a and b above. 

b. No criterion Is prescribed for Part, 2 , but a 

63 raw-score cut-off (equivalent to an S-3 Level) Is established 
for the entire Speaking test. 



Appendix D 
Spanish MCRT Net Administration Time 





Skill Measured 


MCRT Administration Time 
Cin minutes) 


1 


• - ANCHOR 
CRT #1 


ANCHOR 
CRT #2 


FINAL 
EXAM 


TOTAL j 


Listening 
Comprehension 


25 


25 


30 


■ 80 • 


Reading 
Comprehension 


15 


20 


50 


i 

85 




I r an s 1 t i n n 


15 


30 


45 


90 




Speaking^' 


15 


I- 

20 


30 


65 

\ 
i 
1 




vv r I L I n g 


20 


45 


60 


125 


1 


Numbe r 

r rn n *^ r* r 1 t w >k 


15 


15 


f.. 

1 

i 

15 1 


45 




Gene ra 1 
Transc ript ion 


60 


90 


— 1' 

135 


O O c 




TOTAL 


165 


245 


365 


775 



^ \hc exception of the ^peakinj; Test, knowledge of the 
•ipn lanf;uaf,'e is not required for MCRT administration. 



ERIC 



NAME_ 
SSN 



Appendix E 

SPANISH SPEAKING ANCHOR CRT #2 
STUDENT RATING SHEET 

._ DATE 



CLASS NO, 



S-2 COMPUTATION TABLE 



RATING LEVEL: 



LINGUISTIC 
CATEGORIES: 

Pronunciation 

Vocabulary 

Grammar 

Fluency 

Comprehens ion 



.2 
8 
12 
4 

10 



3 
10 
14 

5 
12 



4 

12 
IG 
6 
14 



SCORE = 



SKILL 
POINTS 



CONVERSION TABLE 2-A 
SCORE TO L^IVEL 






CONVERSION TABLE 2- 
SCORE TO PERCENTAGE 


B 


SCORE 


• = LEVEL 


SCORE 




% SCORE 


SCORE 


% SCORE 


Minimum 














Score 


36 = . 1 


52 




lOU 


44 = 


69 






51 




98 


4 3 


65 


37 - 


44 = 1 + 


50 




94 


42 = 


61 


49 




90 


41 = 


■ 57 






48 




85 


40 = 


53 


45 - 


52 = 2 


47 




'0 . 


39 = 


49 


r 




46 




75 


38 = 


45 






45 




70 


37 = 


41 












36 = 


37 


PASS 




SKILL 


LEVEL 







FAIL 



PERCENTAGE SCORE 



EXAM IN JR 



ERIC 



-62- 

Appendix F . . 

ADDENDUM TO EXAMINEE'S GUIDE 

i^PANISH MAJOR CRITERION-REFERENCED ANCHOR TEST #2 

SPEAKING 

SP?iNISH BASIC COURSE ' 
(Modules 4-6) 

Perfornance Criteria 

i;t.^^nn^"''^°f *l Speaking Test is designed to permit an accu- 
comnlJKJ mJ^ 1° competency in Spanish when you have 

' n^oS^H^H f °£ Spanish Basic Course. The information 

provided here is to be used with the instructions provided in 
the Examinee's Guide for the Anchor #2 Speaking TeSt . 

2. Your examiner will be a native speaker of Spanish who has 
ri?aue'''"?i;''^ trained in the face-to-face oral^ntlJvIeS ^Jch- 
m^nof' ^^^^^^"'^"e^ ^'111 base his/her judgment of your Wfor- 
mance upon the linguistic quality of what you say during ?Se 
interview. j j v^^xj-ny uue 

I' Ji'';^^!^"^''^^^^^ categories have been identified as important 
whio^\S communication process. The descriptive criteria 

soi^^in^%^''r^"^'' "^^^ ^° ^^^^^ performance on the 

speaking test are presented on the following page. Each cateaorv 
has been subdivided into three parts and assi^rld a rating scalT 

X, or 3. The rating scale for each category will deter- 
mine the number of points you will receive on the test. 

'?^f^t^i"^Ji"guistic categories are deemed to be of greater 
^^k" °thers.for speaking. Therefore, diffe?Jnf 
nf l^J.?. ^^^^ - assigned which reflect the relative priority 
RLtna\l ^n^^': category. As you can see from the Student^ 

S 4^^^ P^^^ Examinee's Guide), the priorities 

are, m descending order' of importance : i-^x J-ties 

Grammar 
Comprehension 
Vocabulary , 
Fluency 
Pronunciation 



Speaking Rating Scale for Spanish Anchor CRT 12 



Category 


Rating 


T~ - : — 1 

Criteria 


Pronunciation 




An obvious foreign .accent with occasional mispronunciations that cause misunderstanding. 

n uiai^cu iuiciyii atucui. wiiit,u requiies concenirateu iisiening/ anu mispronunciations 
which lead to frequent misunderstanding. 

Frequent errors ar.d a very heavy accent make understanding difficult; requires frequent 
repetition. 


Vocabulary 


3 


General vocabulary permits discussion of most topics listed, with some paraphrasing 
and circumlocutions. 

Choice of words frequently inaccurate, limitations of vocabulary prevent adequate 
discussion of some topics and situations. 

Vocabulary limited to a very basic level on the topics covered in the interview. 


Graimtiar 


3 


Occasional errors, /showing, imperfect control of some major patterns, but seldom 
causing misunderstanding. 

lici^ucuL ciiuio, oiiuwiiiy auuie iiiajor paLcciiia uncouiiOiiea anu causing occasional 
irritation and misunderstanding. 

Constant errors, showing control of few major patterns and causing owasional irritation 
and misunderstanding. 


Fluency 




Speech is occasionally hesitant, with some unevenness caused rephrasing and groping 
for words.. 

Speech is frequently hesitant and jerky; sentence may be left uncompleted. 
Speech is very slow and uneven, except for routine phrases and social expressions. 


Comprehension 




Understands normal educated speech quite well, but requires occasional repetition or 
rephrasing. 

Understands careful, somewhat simplified speech, with considerable repetition, and- 
rephrasing. 

Understands only slow, simple speech; requires frequent repetition and rephrasing. 



12 



ERlC/( 



-64- 



References 

Defense Language InaUtute. Administration and Scoring Manual for 
Foreign Language Oral Product ion . (FSI Interview), Presidio of 
Monterey, Calif, , 1965. 

— • Spanish Basic Course . Instructional Guide and Modules 

1-9, Tests and Workbooks, Presidio of Monterey, Calif., 1975. 

Spanish Basic Course . Major Criterion-Referenced Tests 

(Examiner's and Examinee's Guides and Administration and Scoring 
Manual), Presidio of Monterey, Calif., Foreign Language Center, July 
1976, August 1976, October 1976, November 1977. 

_ . Systems Development Agency (Provisional), Test Development 

Standards . Presidio of Monterey, Calif., July 1974. 

Department of the Army, Interservice Procedures for Instructional Systems 
Development. Fort Monroe, Va.: Headquarters, United States Army 
Training and Doctrine Command, August 1975. 

Lowe, Pardee, Jr. Handbook on Question Types and Their Use in LLC Oral 
Proficien cy Tests (Preliminary Versi.)n). Arlington, Va.: Language 
Learning Center, Central Intelligence Agency,. May 1976. 

Woodford, Protase E. "Testing Guidelines for DLI Tests." Princeton, 
N.J.: Educational Testing 'Service, 1972. 

Additional references as listed in "TRADOC"' Pamphlet 350-30," Interservice 
Procedures f or Instructional Systems Development, Executive Summary 
and Model, pp. 132-49. Fort Monroe, Va.: Headquarters, United 
States Army Training and Doctrine Command, August 1, 1974. 



o • ■ 
ERIC 



ORAL PROFICIENCY TESTING IN NEW JERSEY BILINGUAL AND 
ENGLISH AS A SECOND LANGUAGE TEACHER CERTIFICATION 



Richard W. Brown 
New Jersey State Department of Education 



ORAL PROFICIENCY TESTING IN NEW JERSEY BILINGUAL AND 
ENGLISH AS A SECOND LANGUAGE TEACHER CERTIFICATION 

Richard W. Brown 

On January 8, 1975, New Jersey's governor, Brendan T. Byrne, signed 
Senate Bill No. 811, also known as the New Jersey. Bilingual Law. The law 
provided for mandatory bil ingual education programs in New Jersey public 
schoolj. 

Regulations for use in admihister ing programs in bilingual education 
require that teachers of bilingual and English as a second language 
education possess appropriate certification. 

The New Jersey State Buard of Education, on October 1, 1975, approved 
bilingual/bicultural and English as a second language teacher certifi- 
cation regulations. The State Department of Education's Bureau of Teacher 
Education and Academic Credentials maintains responsibility for monitoring 
the implementation of the regulations. 

Bilingual/uicultural and English as a second language cert if icat ion 
regulations were developed by a statewide committee of experts in bilin-. 
gual and English as a second language education. Th'2 committee consisted 
of public school teachers, college and university staff, Department 
of Education staff, Educational Testing Service sLaff, and members of 
statewide bilingual interest groups. Prior to their final approval by the 
State Board of Education, the certification regulations underwent numerous 
revisions after having been reviewed by educational personnel throughout 
the state. The final draft of the regulations also appeared in the New 
Jersey State Register on two occasions. 

English as y secoh-j language certification regulations require that 
all teachers display "evidence of native or near-native competency in 
English as determined by o'jidelines . . . established by the New Jersey 
State Department of Education.'* To be el igible for standard or 
substandard bilingual/bicuitural certification, ail teachers must provide 
'*demonstrat ion of verbal and written proficiency in English and in one 
other language used also as a medium of instruction." 

Prior to the enactment of the certification regulations in 1975, 
the State Department . of Education sought the assistance of Educational 
Testing Service to develop a method and/or device capable of determining 
(I) native or near-native competency in English and (2) proficiency in 
English and other languages used as media of instruction. 

Teachers in bilingual and English as a second language programs are 
expected to possess sufficient language competency to adequately preseni 
subject matter and to conduct classroom activities. 

According to Educational Testing Service staff, heretofore most 
measures of second- or foreign-language . ability were designed primarily 
to assess those skills normally stressed in formal, academic foreign 
language programs. These measures were not well suited to determine, the 



ability of the examinee to function effectively in the other language 
milieL. Emphasis in such tests was often on formal grammar, grammatical 
terminology^ and literary ana lysis— areas of questionable need for many 
bilingual/teachers. 

The need, therefore, was for an examination or a procedure that would 
mp-isure the ability of the examinee to function effectively in the class- 
room through the medium of English (for teachers of English as a second 
language education) or English and Spanish (for teachers of bilingual 
education). The ability to function effectively would be manifested by 
such things as (1) the ability to comprehend completely the "talk" of 
children and parents, both English speaking and Spanish speaking; (2) the 
ability to communicate in English and Spanish w.ith children and parents on 
school-related and ether topics; and (3) the ability to present subject 
matter in the classroom, carry on classroom discussion, ask and answer 
questions, and explain concepts in both English and Spanish. 

An issue of importance equa) to that of the measurement of language 
proficiency is the determination of minimum competency. That a bilingual 
teacher must be "fluent" in finglish and Spanish seems a reasonable quali- 
''.^'^^VJ^^J '^^^^ "fluent" mean? What level of language performance 

should be the requisite minimum for teachers to carry out their duties in 
bilingual classrooms? 

The instrument and procedures developed by Educational Testing 
Service addressed two broad issues: (1) the evaluation of oral pro- 
ficiency in English and Spanish and (2)' the establishment of criteria for 
determining minimal compet ency . in English and Spanish. 

The system developed for the New Jersey Statu Department of Education 
by Educational Testing Service for the purpose of determining oral 
language proficiency in English and Spanish is known as the Lapquaqe 
Proficiency Program. 

The program utilizes the Language Proficiency Interview (LPL), which 
was developed by linguists at the Foreign Service Institute. The Foreign 
Service Institute provides foreign language training to and certifies the 
foreign language abilities of U.S. Department of State and other federal 
government personnel. 

nong the reasons for the development of the Language Proficiency 
Interview procedure was the absence of a reliable, direct measure of 
communicative competence (listening comprehension and speaking skills) 
that would be appropriate to assess skills from the level of no ability to 
the level of proficiency equivaAent to that of an educated native speaker. 

The Language Proficiency Interview has been- in use for over fifteen 
years. Among the federal agencies using the LPJ and the accompanying 
scale are the Department of State, Department of Defense, and ACTIDVPeace 
CorpL. 



•69- 



The interview procedure as carried out by the Foreign Service Insti- 
tute, the Peace Corps, and others is as follows: 

The interviewee, the interviewer, and a rater/linguist meet for up to 
thirty minutes. During this period the interviewer carries out what 
appeq^s to be a friendly, informal conversation with the examinee. The 
rater/linguist may join in the conversation when and if Ropropriate- The 
interviewer conducts the conversation in such a way that a relatively 
complete sample of the examinee's abilities in the target language is 
obtained. Typically, the interview begins at a relatively simple level 
and becomes progressively more complex. The vocabulary, strucr.ure, and 
comprehension required to continue the conversation becone increasingly 
difficult. When the interviewer and rater/linguist are confident the 
examinee has performed at the highest level of which he or she is capable, 
the interview is concluded. 

The length of the interview is usually in direct proportion to the 
ability of the examinee--i . e . , the lower the level, the shorter the 
interview; the higher the level, the longer the interview. The normal 
extremes are ten and thirty minutes. 

Although it is common for the interviewer and rater to confer and 
agree on a rating, the responsibility for the official rating rests 
with the rater/linguist. 

In addition to the conversation per se, one or more activities 
designed to elicit further evidence of the examinee's ability may be 
undertaken, such as a series of direct translations or a "real-life" 
situation in which the examinee serves as interpreter between a '^mono- 
lingual English" and a "monolingual Spanish" speaker. 

All applicants for New Jersey bilingual/bicultural ano English 
as a second language certification must complete Language ProficienLy 
Interviews. An applicant seeking bilingual/bicultural certification 
must complete Language Proficiency Interviews in English and the other 
language he or she will use in the public school classroom as the medium 
cf instruction. An English as a second language certification applicant 
must complete an LPI in English. 

In New Jersey,, Language Proficiency Interviews may be completed at 
any one of seven centers established by the State Department of Education 
with Ihe assistance of Educational Testing Service. The centers are 
located at Glassboro State Colleoe, Jersey City State College, Kean 
College of New Jersey, Monmouth College, Rutgers Graduate School of 
Education, Trenton State College,, and William Paterson College of New 
Jersey . 

The State Department of Educaiion utilized two principal criteria 
when determining sites for centers: each had to be (1) an institution of 
higher learning offering 3 bilingual and/or English as a second language 
teacher education program and (2) located near public srhuol districts 
containing large populations of bilingual students and teachers. 



-70- 



Interviewers for .the centers were identified, screened, and selected 
for training by the State Department of Education with the assistance of 
Educational Testing Service. The trainees were language specialists 
from New Jersey public schools and institutions of higher learning. All 
trainee-'s participated in training sessions conducted by Educational 
Testing Service. Upon completion of the sessions, the participants were 
certified as of ficial language proficiency interviewers if they met 
all qualifications identified by Educational Testinn Service, including 
the ability to reach an oral J mguage proficiency lewel of 4 in the 
languages in which they were trained to interview. 

As of March 1, 1978, applicants for English as a second language 
certification must reach a proficiency level of -\ in English to be 
eligible for standard certification. A level of 3 ir cnglish and 4 in the 
other language used as the medium of instruction are required for standard 
bilingual/bicultural certification. 

To date, more than 1,400 Language Proficiency Interviews required for 
New Jersey bilingual/bicultural and English as a second language teacher 
certifiction have been completed. 

During the past two years I have been asked, on a number of 
occasions, what I consider to be the strengths of the New Jersey program, 
and what recommendations I would give to any state planning to develop 
certification in these areas. 

I will first list what I consider to be the strengths of our program: 

1. Certification regulations were developed by a statewide committee 
of experts in bilingual and English as a second language education, 
including a representative of the state education association. 

2. Certification regulations require language proficiency for 
both certificates. 

K Educational Testing Service has been assisting New Jersey from 
the beginning in the development of the certification r-gulations and the 
language proficiency interview system. 

4. Oral language proficiency for teacheis is determined by use of 
the I .reign Service Institute language proficiency interview and scale. 

5. Language proficiency interviews are given in a number of regional 
centers strategically located throughout the state so as to provide 
teachers easy access to centers for interviews. 

6. Interviewers are trained by Educational Testing Service. 

7. The high levels of proficiency required for certification assure 
greater opportunitiL ; for successful communication between teachers 
and students in the classroom. 



-71- 



8. The comprehensive cei't if icat ion regulations guarantee that all 
teachers possess appropriate background needed to be more effective in the 
cla ssroorn. RGgulations For both certificates contain extensive cultural 
components. The English as a second language regulations provide for 
comprehensive study in linguistics. 

9. The results of recent litigation regarding the certification 
regulations have strengthened the overall program. Federal and state 
courts have determined that the regulations are legal, fsiir, and non- 
discriminatory. 

Second, I will identify some suggestions I would give to states 
planning to develop certification regulations for bilingual and English 
as a second language teachers: 

1. Provide for funding at the state, level to support the imple- 
mentation of bilingual legislation. 

2. Communicate with state legislators during the developmental 
stages of legislation. 

3. Involve representatives of all statewide interest groups, 
including public school teachers and administrators, when developing 
requ lat ions. 

4. Require oral language proficiency in English for teachers of 
English as a second language, and in English and the other language being 
used as the medium of instruction in the classroom for bilingual teachers. 

3. Utilize the Foreign Service Institute language proficiency 
interview system . 

6. Request the assistance of Educational Testing Service when 
developing an interviewing system. 

7. If possible, pretest the language proficiency system chosen for 
state use prior to implementing such a program. This should include 
conducting validity and reliability studies. 

8. Require that tapes of interviewees be rated by more :.han one 
rater . 

9. Contact other states that have instituted regulations to request 
information regarding their development.il and implementation procedures. 

10. Develop recjiona interview centers within the state, as has been 
fJono in New jersey. 

M. [mm prospective interviewers who hnve appropriate bilinquai 
rKHj/or [-nijlir^h as a second lancjuaqe educational experience. 



12. Work closely^ with institutions of higher learning that wish to 
develop teacher training programs. 

13. Consider all areas previously identified as strengths of the 
New Jersey program. 

14. Provide discussion sessions throughout the state for teachers who 
will be affected by regulations. At that time, explain all ramifications 
of the implementation of the regulations, including the language pro-- 
ficiency interviewing and rating systems. 

15. Educate the public. Provide information to parents of children 
who will be affected by the regulations, either through workshops or 
with printed materials. 

16. Provide opportunities for teachers who possess teaching 
experience in bilingual and/or English as a second language classrooms to 
be given credit for such experience. The credit should be applicable 
toward standard certification. 

17. Provide all parties concerned sufficient time to fulfill all 
rules and regulations related to bilingual and English as a second 
language certification. 



•1/ 



-73- 



References 

An Approach to the Assessment of English and Spanish Oral Proficiency of 
Bilinqual/Bicultural Teachers and Teacher Candidates , Princeton , 
- N.J.: Educational Testing Service, June 1976. 

Language Proficiency Program , Bulletin of information. Princeton, 
N.3.: Educational Tu'sting Service, 1976. 

New Jersey State Department of Education, Bureau of Bilingual Education. 
New Jersey Bilingual Law . Trenton, January 8, 1975, 

. Regulat ions for Use in Administering Programs in Bilingual 

Education as Provided for in Chapter 197 of the New Jersey Laws of 
1974. Trenton, 1975. 

New Jersey State Department of Education, Bureau of Te'=icher Education and 
Academic Credentials. New Jersey Bilingual/Bicultu.ral Teacher 
Certification Regulations . Trenton, October 1975. 

. New Jersey English as a Second Language Teacher 

Certification Regulations . Trenton, October 1975. 



■Si 



ADAPTATION OF THE FSI INTERVIEW SCALE 
FOR SECONDARY SCHOOLS AND COLLEGES 



Claus Reschke 
University of Houston 



ADAPTATION OF THE F'SI INTERVIEW SCALE FOR SECONDARY SCHOOLS AND COLLEGES 

Claus Reschke 

\ 

The prototype for the direct oral interview prof ibiency \ests 
currently in use by U.S. government agencies and in a few schools and 
colleges is the interview test developed in 1956 by ^the staff of the 
Foreign Service Institute (FSI) of the U.S. Department of State. Although 
this test has undergone serveral changes and refinements during the past 
twenty-two years, its original format-is still basically intact. This is 
because the test has, over the years, repeatedly proven itself to be a 
highly face--valid> extremely reliable and--for the specific needs of the 
FSI — very practical vehicle with which to determine the oral proficiency 
of career diplomats and other foreign service personnel whose jobs- require 
foreign language prof iciency . 1 

Because this particular test meets so well the basic criteria of 
reliability and pr act ical i ty , if not also the criterion of validity, an 
increasing number of educators teaching foreign languages in high schools 
and colleges are considering using it to determine the oral proficiency of 
their students at various points during their language study. High school 
teachers could use the test to measure the oral proficiency of their 
students after two, three, or four years of language study. In college 
the test could have several uses. It could, of course, measure the oral 
proficiency of students after two, three, or four semesters of language 
study. It could also serve as part of a diagnostic and qualifying exam- 
ination in undergraduate foreign language education programs, to assure 
that only those students who have reached at least an oral proficiency 
level of 2 ar:? allowed to start the student-teaching phase of their 
programs. 2 At the graduate level, the test could be used as part of a 
qualifying examination for admission to graduate programs and for awarding 
teaching fellowships in foreign language departments. 



iThose unfamiliar v/ith the FSI test can find a detailed desnription of 
it in the article "The Oral Interview" by Claudia P. Wilds, one of the 
originators of the test, in Testing Lgnquaqe Proficiency , edited by 
Randall L. Jonos and Bernard Spclsky (Arlington, Va.: Center for 
Applied Linguistics, 1975), pp% 29-4^. 

2A very elaborate interview system is being used by Purdue University 
in its teacher education program. There, each undergraduate major 
in teacher education must complete two interview sessions with a 
three-person testing team, consisting of the Ci.ordinator for foreign 
languages and literatures education, a me thodoiogist in the target 



-78- 



One of Mre prime reasons why this test is of such interest to 
teachers who wish to assess the oral proficiency of their students 
is the test'ij hjgh reliability. A cross-language reliability study, 
conducted by the fSI in 1973, included French, German, and Spanish tests, 
and yielded a reliability coefficient of .85. Other in-house reliability 
studies conducted by the FSI, which were limited to only one language, 
have produced similar results, with one study, based on French tests 
given, showing a reliability coefficient of .93.3 

Another reason why this particular oral proficiency test is of great 
interest to high school and college teachers is the thorough evaluation 
criteria set up f'jr it by the FSI. Table 1 shows that the FSI evaluates 
a candidate's ir Lerview performance in five categories: accent (pronun- 
ciation and i -^.tonat ion ) , grammar (morphology and syntax), vocabulary, 
fluency, and comprehension. A weighted point system has been developed by 
the FSI, with the weights distributed as follows: accent 0, grammar 3, 
vocabulary 2, fluency 1 and comprehension 2. Thus grammar, vocabulary, 
.=ind comprehension are considered by the FSI to be the most important 
elements of oral proficiency, a view most language teachers would be able 
to support on the basis of their own experience. The FSI's weighted 
scoring system (Table 2) was derived from multiple-correlation studies 
using the level ratings that had been assigned to numerous examinees. 4 



language, and an instructor in the taraet language. The first interview 
is diagnostic in nature; the second one, given at the completion of an 
advanced conversation course in the target language, seeks to determine 
if the student meets predetermined minimal oral proficiency standards 
before he or she is given permission to start the semester of student 
teaching. 

At the University of Houston, an interview test, conducted by three 
faculty members, is used only in the German teacher education program. 
It is part of a comprehensive examination on language, culture, and 
literature that every German teacher education major must pass before 
starting the semester of student teaching. 

^For the results of a more recent reliability study of FSI test scores, 
see Marianne L. Adams's paper in thic volume: "Measuring Foreign Language 
Speaking Proficiency: A Study of Agreement among Raters." 



^Wilds, p. 32. 



-79- 



TABLE 1 
FSI Speaking Evaluation 











2 


3 


4 


5 


6 




1. 


Accent 


foreign 


4 


3 


2 


2 


1 


0 


native 


2. 


Grammar 


inaccurate 


5 


12 


18 


24 


30 


36 


accurate 


3. 


Vocabulary 


inadequate 


4 


8 


12 


16. 


20 


24 


adequate 


4. 


Fluency 


uneven 


2 


4 


6 


8 


10 


12 


even 


5. 


Comprehension 


incomplete 


4 


8 


12 


15 


19 


23 


complete 



TAPlE 2 
FSI Level Assignment 



FSI 


FSI 


Score 


Rating 


0-15 


S-0 


16-25 


S-0+ 


26-32 


S-1 


3 3- a 2 


S-1+ 


43-52 


S-2 


53-62 


S-2+ 


63-72 


S-3 


73-82 


S-3+ 


83-92 


S-4 


93-99 


S-4+ 



o sc. 
ERIC 



However, there Gre two major reasons why the FSI interview test, in 
its present form, is not really suitable for use in high school and 
college. 

First, the test's administration, which has proven to be very 
practical for the FSI, would be much less practical for schools and 
colleges. As it stands, two testers are required for each testing 
session,^ one a native speaker of the target language and the other a 
certified language examiner, who may be either a native speaker and 
instructor of the targeh language or a linguist thoroughly familiar with 
the language, 6 Past experience of the FSI, CIA, and Peace Corps has 
shown that an examination team is able to conduct about fifteen interviews 
per day. 7 Since schools and colleges must test hundreds of students 
at the end of a term or a semester, however, the man-hours involved would 
be almost prohibitive. In addition, administering the test costs an 
estimated $40 pecx examinee, 8 a figure that, when multiplied by hundreds 
of students, would also be prohibitive. 

The second major problem with using the FSI ^est in high school and 
college lies in the absolute oral proficiency rating scale used by 
the FSI and other government agencies. Ranging from 0 to 5 — that is, 
from almost no speak ing* sb il i ty to a thoroughly bilingual fluency, 
with a "plus" level above each primary level^^-the scale is far too 
broad in scope to be meaningful for use when testing the limited oral 
proficiency found in high schools and colleges. John Carroll's well- 
documented study of 1967, which sought to determine the foreign language 
proficiency of college language majors, revealed that few of them ever 



^Of the five government agencies administering the interview test (FSI, 
DLI, NSA, CIA, and CSC), only the Defense Language Institute uses, due to 
limited resources, one tester. See Pardee Lowe, Jr., The Oral Language 
Prof ic lency Tes t (Washington, D.C.: Interagency Language Round Table, 
1976), p. 2. 

6See Wilds, p. 30. Before a language examiner can be certified, he or 
she must have reached at least the oral proficiency level 4 in the target 
language. 

7joh n L. D. Clark, "Theoretical and Technical ons iderat ions in Oral 
Proficiency Testing," Testing Language Proficiency , p. 16. 

^This figure is based on information supplied for the year 1977 by 
the Testing Committee of the Interagency Language Round Table, U.S. 
Government . 

"plus" designation indicates that a candidate has reached a profi- 
ciency that substantially exceeds the minimum requirements for a given 
level but does not meet all the minimum requirements for the next higher 
level. See Wilds, p. 36. 



-81- 



reached the 2+ level on the FSl scale during their senior year, whether 
they were studying French, German, Russian, or Spanish. 10 I believe 
this situation has not changed much in the past ten years. Therefore, 
most of the students tested .'n high school and college would fall into 
only three F5I categories, 1, 1+, and 2, making it difficult to show 
differences among them or to indicate their progress over a period of one 
6r two semesters. 

It appears, therefore, that before the F5I interview test can be used 
effectively in high schools and colleges some major modifications are 
necessary. 



Sucirjested Modifications tc interview Procedure and Scale 

I believe that the excessively high time and cost factors related to 
the 'administration of the test could be reduced without much joss in the 
reliability of the test results. The method 1 suggest is to reduce the 
te<;tiricj team from two to one and to increase the number of students tested 
from one at a time to three, four, or even five. 1 believe the test would 
then be practical and would also remain a reliable instrument, so long as 
care were taken thnt all students being tested at the same time were 
at about the Sci.ric level of proficiency. 

The second problem with using the FSl test in high school or 
cfH lerjr'--the broad aijsolute proficiency rating scale — is more complex but 
alsr) has a solutiori. l\^c solution I propose is to modify the FSl rating 
■scjilt*. let us add to the six wfiolc* numbers and the five "plus" levels 
used tjy thie fSI a second series of numbers that will refine the examinee's 
scoie and make it more meaningful. Fach F5I number can be followed by a 
(le/Mmal point, and then by one or more additional "fine-tune" ur 
p r. ^ r f f J r ma n c e - i n t n r p r e 1 1 v e ru j mb e r s . 

I scf; this [jro()(jsal as a combination of two scales, one vertical and 
one hoiL/ontal. [he FSl ratings fall on a vertical scales 

0+ 
• 1 
1+ 

n 

2 + 

etc. 



lOJohn n. Carroll, Foreign La iquaqe ^Attainments of Language Ma.jors in 
the Senior Year: A Survey Conducted in U.S. Colleges and Universities 
(Cambridge, Mass.: Harvard University , 1967), pp. 10 f f . , 4G ff.; .lohn 
B. , Carroll, "Foreign Language Prof iciency, Levels Attained by Language 
Majors Near Graduation from College," Fr reiqn Lanquaqe Annals 1, No. 2 
(1967), pp. 131-51. ~ • , ' 



-82- 



To this scale I would add a scale of hori.^ontal numbers it each of the 
vertical scale levels, designed to provide as much precise data about a 
student's linguistic performance as a teacher might want. 

For example, two students' oral proficiency may lie somewher oetween 
the FSI ratings of 0+ and 1.: Which of the two students is more prof icienf? 
The horizontal scale might indicate that the first one has a fine-tune 
score of 3 and the second a score of 7. The total ratings for these 
stucents could then be written as 0+..3 and 0+.7, visually awkward ratings 
to which I shall return shortly. The second student has, in any case, 
been shown to be more prof ic ient— on the basis of the combined vertical 
and horizontal scfJes. 

Theoretically, it would be possible to add an infinite number of 
digits to the horizontal scale. For example, the fine-tune digits 3 and 7 
in the above example could be fi^llowed by five other digits indicating, on 
a scale of 0 to 9, the strenyth of the student's performance in each 
or the five evaluatea categories (accent, grammar, vocabulary, fluenc" 
and comprehension). .Six ^-Iditional digits might represent diagnostic 
ratings, with the first digit again a composite rating, on a scale of 0 
to 9, followed by the five digits representing individual ratings in the 
five evaluated categories. These digits could, for example, provide 
information in the areas of phonology and syntax that would show whether 
a student has started to internalize a faulty phonr >gical or grammatical 
sys.iem, and to what extent. Another group of six digits, the first one 
again a composite of the following five, could represent a sperific 
projection of the. degree of .success that might be expected from future 
lanquaqe training in each of the five evaluated categories. 

The possibilities, for use of the horizontal scale seem endless. 
However, the value of expanding it beyond the composite rating for each 
of the three proposed major areas (fine-tune, diagnostic, and projection) 
IS questionable, since detailed ratings in only these three areas would 
i-es-jlt in an overall rating nineteen digits long. Thio would be an 
extr-me.y awkward number to read and interpret. Rr;taining only the 
composice rating d,.,it for each area, on . the other hand, would yield a 
total rating for each test performance of only four digits. This number 
would' certainly provide both student and teacher with far more information 
cibout the student's linguistic performance on the test than the 
single-digit FSI level assignment yields. 

Of course, narrative descriptions would have to be written for each 
point on the horizontal scale at each of the eleven proficiencv levels. 
The task seems endrmous. It could be simplified, however, if only three 
narrative des-ript.ions were written for each of the three areas (fine- 
tune, diagnostic, projection) oroposed for the horizontal scale. Each 
aiLa would then have a narrative description for the subranges 0-3, 4-6, 
and 7-9. Furthermore, -Inr:. oigh school anJ college students would seldom 
exceed the 2+ level f n tht bl absolute oral proficiencv rating scale, why 
not limit thr- narrati/t ■ iptions for the horizontal scale to the 0+ to 
2+ range ' r, tfe vertica.' ..vjie? 




-83- 



I recommend that the FSI rating scale be modified only in these ways, 
however, and not in others. I would retain the weighted scoring system 
used by the FSI anJ the present level assignment system, where the level 
IS determined by the number of 'nts achieved by the examinee in each of 
the five categories in which his performance is being rated (see Tables 1 
and 2). 11 Both have proven over the past twenty years to be highly 
re 1 lable measures of orr.l proficiency. I would suggest, however, that 
all eleven points on the FSI absolute oral proficiency rating scale be 
converted into two-digit numbers to facilitate recording of the test 
results. Thus level 1 would be recorded as level 10, level 1+ as 15, 
and level 0+ as 05.12 This procedure would keep intact the narrative 
descriptions developed by the FSI for each general proficiency level and 
.allow us to continue to indicate a strong test performance that warrants 
a plus rating without having an awkward plus sign preceding the decimal 
point. Also, the chance of an administrative error occurring in the 
recordmc of the student's rating un his permanent school record would be 
substantially reduced by changing the plus sign to a number, an aspect not 
to be treated lightly in this period of increased reliance on computerized 
record-keeping systems in high schools and colleges. 

Example of Lxpanded Diagnostic Scale 

So far I have discussed the possibilities of adapting a. few adminis- 
trative procedures and the rating scale of the FSI interview. test to meet 
the realities and needs of high school and college teachers. I would 
like to concentrate on only one of the three areas on the proposed 
horizonto scale, the one that involves the first digit after the decimal 
poirit. This IS the most important of the three digits, because it 
contains t!ie most useful information for teacher and student alike: the 
progress a stur^ent has made during a given period of time — say, one or two 
rjemesteri . 



llHowever, 1 would suggest that the range of points in the first 
category on the FSI scale, accent, be reversed, since it makes little 
sense tc award zero points for a "native" accent and four points for an 
obviously "foreign" one The number of points involved is nominal. 

12lt may be argued that the conversion of the " + " o the digit "5" 
creates a false impress ior. » since the FSI assigns a plus rating only 
to a performance that substt- ', '> ally exceeds the minimurn requirements for 
a given level but does not meet all the minimum requirements for the 
next" higher level. Use of the digit "5'' to indicate a plus rating seems 
to imply, however, thnt the candidate's linguistic performance (on a 
scale of 0-9) met h^JJ^ the minimum requirements for the next higher 
level, not most of them, as FSI criteria aemand. (See Wilds, p. 36.) 
The objection is valid, the p'^oblem minor. All that is needed is Ltj 
substitute for the "5" a "7" or an "8" to convert the "+" to a numeral. 



-84- 



This first cooiposite digit after the decimal point designates 
the fine-tune level of an examinee's lingui?;tic performance. For this 
first digit on the horizontal scale, I propose the following preliminary 
narrative descriptions. They have been written using as a guide 
"Descriptions of the FSI Absolutr Oral Proficiency Rating Scale" and the 
"Detailed Description of the FSI Checklist"13 developed by the FSI in 
1961. 



F ine-Tune Level Description 
General proficiency level: 05 

Range 05.0-05. 3 i Candidate's pronunciation is nearly unintelligible; 
his use of grammar is almost always inaccurate; his vocabulary consists 
mostly of isolated high-frequency words that h Mses haltingly; his 
ability to converse is extremely limited and does w.t go beyond answering 
Simple yes/no questions. 

Range 05.4-05.6 ; Candidate's pronunciation is frequently unintelli- 
gible; his use of grammar is often incorrect; his vocabulary is extremely 
limited and insufficient to carry on eve* the most simple conversation; 
his speech is halting and consists of individual words and simple phrases; 
his conversational skill barely goes beyond the ability to answer simple 
yes/no questions. 

R ange 05 . 7--05 . 9 ; Candidate's pronunciation is occasionally 
unintelligible; his use of grammar is frequently incorrect, preventing 
communication, but he shows some control over one or two major grammatical 
patterns; his vocabulary is quite limited, b'lt he is able to carry on, 
though very haUingly, the most simple and fragmentary conversation 
about himself and his family (telling time, naming simple after-school 
activities, talking about main meals, telling the size of his family, and 
so on); he understands only slowly spoken speech and often-repeated simple 
statements and questions. 



Cenoral proficiency level: 10 

Range 10.0-10. 3 ; Candidate frequently makes major pronunciation 
errors that impede understanding and require him to repeat his utterances; 
his rate of grammatical errors is extremely high, but he has some control 
over two cr three i.^ajor gramme-'^ical patterns, which he en.^jluys correctly 
with a fair degree of consistency, so that communinat ion ,' although fre- 
quently hampered, is not entirely impossible; his range of vocabulary is 



i^Lowe, pp. 29-30. 



-85- 



lirnited to the basic personal and social level (e.g., time, three or 
four food items, two or three beverages, pr aiary means of transportation, 
major weekend activities); his speech is oiow and uneven; he understands 
very simple speech based on high-frequency situations or topics of a 
personal or social nature (e.g., age, simple family relationships, simple 
activities performed around the house, living accommodations at home), but 
requires frequent repetition and rephrasing of questions and statements. 

Range 10 . 10 . 6 : Candidate occasionally makes major pronunciation 
errors that interfere with understanding him consistently; his rate of 
grammatical errors is high, but he has good control over two or three 
major grammatical patterns, which he employs correctly with a high degree 
of consistency, allowing him to communicate at a fairly simple level; his 
vocabulary, although still limited to the basic personal and social level 
(e.g., four to ten food items, three to four beverages, simple purchases, 
the departure times of trains, planes, buses, and streetcars), allows him 
to communicate very briefly, simply, and imperfectly on a variety of 
high-frequency topics (e.g., daily meals, ordering two or three simple 
meals in a restaurant, describing in simple terms three to four activities 
at home, describing in simple language a visit to a grocery store, movie, 
theater, cr concert, asking for simple directions); his speech is slow and 
uneven, except for short, routine sentences and phrases; his understanding 
is slow, although he does understand very simple statements and questions 
about a variety of high-frequency situations he would be expected to 
encounter daily, socially, or as a tourist, even though he may require 
frequent repetition and rephrasing of statements. 

Range 10 . 7-10.9 : Candidate seldom makes major pronunciation errors, 
but frequent minor errors hamper understanding; he makes many grammatiral 
errors but has good control over three or four major grammatical patterns, 
which he employs correctly with a moderate degree of consistency, allowing 
communication to pfocf?ed at a fairly simple level; his vocabulary enables 
him to perform a variety of linguistic tasks (e.g., giving simple direc- 
tions, asking for lodging, ordering fifteen to twenty-five different items 
of food and six different beverages, inquiring about the cost of postage, 
purct-iasing some items of clothing), even though his choice of words ij 
frequently inaccurate; his speech is hesitant, and his sentences are very 
often left incomplete; he understands slow, simplified speech on a variety 
of personal, social, and tourist topics, but requires frequent repetition. 

General proficiency level' 1:? 

Range 15.0-15^3 : Candidate occasionally makes minor pronunci ?]tion 
errors and has a distinctly foreign accent, which requires highly 
concentrated listening and leads occasionally ^Q misunderstandings; his 
grammatical errors are of sucl: a nature as tu .^.ndicate that there ^re 
three or four grammatical pa-.terns o^er which he has no consistent control 
(e.g., auxiliary verbs in perfect tenses, pas^t participles uf verbs, word 
order), causing occasional irritation and leading frequently to misunder- 
stan-^'ings; he sometimes chooses incorrect words, but his vocabulary is 



-86-^ 

large ounugh for him to be able to converse haltingly about routine travel 
needs ve.q., changing money, asking for and giving simple directions, 
ordering three different major meals, making simple introductions, making 
Simple telephone calls, planning a trip with a travel agent) and a select 
group of topics in the personal and social domain (e.g., family, hometown, 
education, occupation or planned career); he understands quite well 
careful, somewhat simplified speech, but requires occasional repetition 
and rephrasing of statements. 

Range 15. 4-15.6 1 Candidate makes few pronunciation errors but 
has a strong foreign accent that requires concentrated listening; his 
grammatical errors are consistent enough to be categorized; his range of 
vocabulary allows him to talk with confidence about himself and other 
people, make iat roduct ions , discuss in simple language major events, 
describe medical needs to a nurse or pharmacist in simple terms, arrange a 
meeting with someone, and communicate to a service station attendant 
routine maintenance instructions for h,is car; his speech is sometimes 
jerky, often hesitant; occasionally sentences may be left uncompleted; 
however, he understands quite well somewhat below normal-rate speech that 
nas been slightly simplified for his benefit, although some repetition and 
rephrasing of statements is required. 

Range 15.7-15.9: Candidate's accent is quite foreign sounding and 
requires some concentrated listening ; his pronunciation errors are few and 
mostly random; grammatical errors are of two kinds, random and consistent 
(some grammatical patterns are used incorrectly); his vocabulary range 
allows him to discuss in simple language, using many circumlocutions, some 
current events and a few high-frequency situations and topics of his own 
or his father's profession; his speech is hesitant; he frequently gropes 
for words and may need two or three starts before completing a sentence; 
he understands fairly well normal-rate, but somewhat simplified, speech; 
however, he may require the speaker to repeat or rephrase a comment 
occasionally. 

General proficiency level; 20 

Range 20. 0-20. 3 : Candidate's accent is markedly foreign; he makes 
few but consistent pronunciation errors; his grammatical errors, which 
occasionally lead to misunderstandings, show that he lacks complete 
control of some major grammatical patterns; his range of vocabulary is 
adequate to handle confidently but not fluently inquiries and casual 
conversations about family snd friends, current employment, trips, and 
his studies, using simple constructions and circi locutions; his speech 
is somewhat hesitant; at times he gropes for words; he comprehends normal- 
rate speech quite well, only occasionally asking for the repetition of a 
word or phrase. 

Range 20.4-20.6: Although the candidate's accent is foreign, his few 
misprcnunciations are mostly random and only occasionally interfere with 
understanding; his infrequent grammatical errors show imperfect control of 



Er|c 'J- 



-87- 



Geveial qrarnrnat ical patterns, but they seJdom lead to misunderstandings; 
hi5> vocabulary allows him to express himself, using simple constructions, 
quite accurately and with some confidence on a number of topics, including 
current events .is well as his daily routine, studies, work, hobbies, anJ 
interests; he is able to describe a person or place in some detail, can 
narrate a sequence of events, and can ask in simple language for help when 
he sees himself con nted with difficulties or complications in his 
studies or his work; ..^s speecfi is confident and only occasionally inter- 
rupted f)y (jroping for words; his comprehension of normal, educated speech 
IS not perfect and leqiiires the speaker occasionally to repeat or rephrase 
h i s s e n t e nc e s more si (up L y . 

Hanqe 20 . 7-20.9 ; Candidate 's few mispronunciations are slight and 
random; his accent is foreign; neither shortcoming seriously interferes 
with understanding; most of his grammatical errors are also random and 
fieldom interfere with understanding; his vocabulary is sufficiently large 
that he can express himself simply and with some circumlocutions on a f ew 
social and professional topics, as ^ as they are general enough in 
nature not to require spec i a 1 i ze d vocabu la ry ; his speech is somewhat 
uneven, caused by occasional rephrasinqs of sentences; his comprehension 
of normal, educated speech is nearly perfect, and he rarely requires 
.serihenc:eri to be repealed or rephrased. 

fif^neral profiriency le*vel: 23 

Ranfje 23 . U- 2 3 . 3 ; Candidate's accent, although foreign, and his 
mispronunciations, which are minor and randoin, rarely lead to misunder- 
standing's; random grammatical errori ^re frequent; consistent grammatical 
err(jrs that show imf)erfect control of grammatical patterns are limited 
to two or thiree; his choice of words is sometimes inaccurate, but his 
vocabulary range permits him to discuses with seme difficulty general 
student, professional, and social problems (e.g., financial problems, 
car repair, house repair/rebuilding, heall i problems); his speech is 
occasionally hesitant, caused by groping for the correct word; he under- 
stands normal, educated speech and seldom needs to have statements 
rephrased or restated for him. 

r^anqe 23.6-23.6 : Candidate's accent is recognizably foreign; his 
error-j in pronunciation aie frequent but of Jittle consequence with regard 
to understanding; occasional grammatical errors are random; ont} or two 
imperfectly controlled grammatical patterns lead to consistent errors, 
which, however, have little effect on understanding; his vocabulary 
includes a number of professional terms tfiat extend the range of profrn- 
sional topics he is able to talk about; his speech vv'hen talking about 
more s^^ecial i.zed professional topics is hesitant and marked by frequent 
gropi'^g for the correct words, but he comprehends most conversations of a 
nontec :nical nature and some of a specialized, professional one. 



-88- 



Ranqe lb.l-lb.3: Although candidate's accent can still be classified 
as foreign, his rare errors in pronunciation do not interfere with commun- 
ication; his grammatical errors are few, (mostly random, except for perhaps 
one recurring pattern of error; his vocabulary inventory is large enough 
to allow him to discuss some special, professional interests with a 
colleague, although !,e uses simple constructions and interrupts his 
speech frequently to grope for the correct word; consequently, his speech 
IS somewhat uneven, but he understands a na,tive speaker of the target 
language well, except for very colloquial or too technical speech. 



There is no need to reinvent the wheel. The FSJ interview test is in 
principle the best oral proficiency test we have. Its reliability . is 
high, Its administration and evaluation procedures have been developed 
tested, and retested numerous times over the past two decades by govern- 
ment testing teams. These factors are invaluable to those educators who 
seek to find a testing instrument with which to measure accurately the 
oral proficiency of their students. 

I believe the few minor changes I have suggested in the test's 
administration procedure, and the major adaptation I propose here for its 
rating scale, meet the two basic objections frequently leveled against the 
^bl test when its use outside the government is being debated: the 
excessive amot of time and money required to administer it, and the too 
broadly curu FSI proficiency levels, which are not very meaningful 

when testing lmu limited oral proficiency of high school and college 
students. 



INTERVi™ TECHNIQUES AND SCORING CRITERIA 
AT THE HIGHER PROFICIENCY LEVELS 



Randall L. Jones 
Brigham Young University 



ERIC 



INTERVIEW TECHNIUUES AND SCORING CRIIERIA AT 1HE HIGHER PROFICIENCY LEVELS 

e 

Randall L, Jones 



Despite its acknowledged shortcomings, the oral inteiview remains 
the most useful and valid instrument for measuring spoken language 
proficiency* It closely approximates a real language situation and 
provides a wide variety of speech samples for evaluation. It is also" 
sensitive ^o the ei ■s.ire range of language proficiency, i.e., from 0 to 5 
on the FSI scale. It is not calibrated finely enough to discriminate 
well within levels, but that, after all, is not its original purpose. 

In 1973 I spent several weeks interviewing language testers at the 
CIA and the FSI. Among other things, I asked them what they felt were 
significant problems with the oral interview technique. One of the 
most common responses was that the higher proficiency levels were very 
difficult to evaluate. (The higher levels ^re to be understood here as 3+ 
and above.) There is little problem for fi trained tester to discriminate 
between a 1+ and a 2, but there is lejs cer' "^in^y when it gets into the 
area from 3+ to 5. It generally takes longer to administer an oral 
interview to an examinee whose proficiency is at a high level, but the 
problem is really more^'than a function of time. 

I would like to suggest four principal reasons for the difficulty in 
evaluating oral proficiency at the higher levels. (1) The definitions for 
levels 4 and 5 are not specific enough to provide a basis for making a 
valid judgment. (2) The standard list of performance ""actors — grammar, 
vocabulary, fluency, oronunciation, and comprehension — is not sufficient 
to distinguish proficiency at the higher levels. (3) The nature of the 
oral interview is such that it does not provide an efficient method of 
eliciting language performance at the higher levels. (4) Because the 
number of examinees at the higher levels is relatively small, testers 
do not have the opportunity to develop a feeling for the important 
distinctions between aiid among these levels. 

The matter of the proficiency definitions, I feel, is important, 
and the government language community should consider the possibility 
of making revisions. Levels 1, 2, and 3 correspond to r:atural stages 
nf orof iciency development, and the definitions capture these stages 
,Le well. Level 1, for example, is often referred to as tlie "survival" 
.evel; i.e», the speaker can communicate in the language sufficiently 
well to take care of his important needs. But he has difficulty holding 
up his end of a conversation for very long, and his control of grammar 
and breadth of vocabulary are weak^. Level 2 is often referred to as 
the "courtesy" level; i.e., the speaker is able to engage in sustained 
conversation without a great deal of effort, even though he mc\y make 
numerous errors and may not be able to express himself precisely in 
many areas. He is confined more to w hat , when , who , arid where , having 
difficul^y with how and why . The 5 level speaker has, in a sense^ 
"arrived." He has confidence in using the language, and he understands. 



-92- 



his own strengths and limitations. His ability for expression is very 
good in his own area of interest and fair tu good in other general areas. 

The definition for level h, how-ver, does not provide much help in 
making a satisfactory distinction between levels 3 and 4. The level 4 
definition does introduce two new tasks: ability to "respond appropri- 
ately even in unfamiliar situations" and to "handle informal interpreting 
from and into the language." But these descriptions are very vague and 
nothing is said about what the unfamiliar situtions or interpreting task 
migrit_4^— One sentence in the definition for level 4 is especially 
troubling. It states that the level 4 speaker "would rarely be taken for 
't .^.^ speaker." My experience with German is that nonnatives are often 
told that they "speak just like a native German." Even a level 1 speaker 
can pass for a native if his pronunciation is good and he keeps his 
sentences restricted to those he can say without errors. 

The definition for level 5 seems at first to be somewhat more 
satisfying in that it is the highest mark on the scale, the ultimate. The 
speaker s proficiency must be equivalent to that of an educated native 
speaker. The obvious question here is, how does an educated native 
speaker speak? What exactly is the absolute criterion against which we 
are judging alJ our examinees? Do we really have a good intuitive 
reeling about it? 

The second reason mentioned above concerns the list of performance 
factors. There is no question that a level 4 speaker has better control 
over structure, vocabulary, etc., than the level 3 speaker, but I feel 
tnere is an additional factor that becomes important at this point: the 
sociolinguistic factor. I do not mean sociolinguistics in the broad 
sense, but rather those aspects of language that have more to do with 
social interaction than with imparting information. Common examples 
include expressing gratitude, responding to an expression of gratitude 
excusing oneself, responding to such an excuse, expressing greetings 
and farewells, paying a compliment, receiving a compliment, declining an 
invitation, expressing surprise or annoyance or anger, complaining, and so 
on. Social communication also includes the use of hesitation words and 
other noncommunicative words and phrases. In many cases it does not 
concern what is said so much as when and how it is said. For example, in 
our own culture the proper response to a compliment is usually "thank 
you, but in many cultures that would be considered impolite. If we 
sneeze it is expected of us to say "excuse me," but in some cultures 
nothing is said, because it is not considered polite to draw attention to 
cne sneeze. The beginner does learn standard phrases for expressing 
gratitude, excusing himself, or whatever, but the presumed standard 
phrases often found in the textbooks are in many cases seldom used by real 
native speakers. I suggest that sociolinguistic sensitivity be added to 
the list of performance factors, and that it be incorporated into the 
definitions for levels 4 and 5. 

The oral interview is really not an interview in the strict sense of 
the word, but rather a ccnversation between two or more people. It is 
also a test in that one of the partners in the conversation is providing 



-93- 



Gtimuli and the other one is giving responses. But there is a lot of room 
for variation, and the examinee can often avoid problem areas by talking 
around them. How can the examinee's "high degree of fluency and precision 
of vocabulary" really be demonstrated? The fact is that the interview 
techniques is not notably efficient for eliciting specific speech samples 
bejyond the 3 level. It requires a lot of time to obtain very little data. 
Other noncon ve rsat ional techniques are thus necessary to get at the 
importarit aspects of proficiency at the higher levels. It is true that 
such techniques tend to be artificial and somewhat removed from real 
language s i huat ions , bu t they can nevertheless be valid indicators of 
language proficiency. 

The fourth problem mentioned above relates to the fact that most 
testers are so rarely exposed to examinees above the 3 level that they do 
not develop a feeling for how 4 and 5 level speakers should perform. 
Fhis alr^o raises an interesting question: Is there really a need to test 
beyond the 3 level? I have heard the suggestion made that anyone who is 
obviously above the 3 level should be put into the category 3/5, that is, 
somewhere between 3 and 5, I do not believe there are any language- 
essential positions in the government designated at the 5 level, and 
probably very few at the 4 levels It seems that knowing a candidate is 
beyond 3 would be sufficient. This is, of course, a managerial and not 
a liriguistic issue, but it seems that if there are five levels of pro- 
ficiency, we have an obligation to develop suitable techniques for testing 
at each level. With regard to the training of testers, after the criteria 
for performance at the higher levels have been- more clearly defined, 
samples of 3+, 4, 4+, and 5 level speakers can be recorded and annotated 
for training purposes. 

I feel that at the present time the range of proficiency Invels from 
5+ to b is not properly understood. There is, however, good evidence 
that there are criteria that can distinguish among the specific levels 
within this large realm. In an attempt to get closer to the problem, 
I considered several methods of eliciting language performance from 
examinees that would be useful in evaluating the higher levels. The 
procedures are not new with me, and in some cases they have already been 
tried by oral interview testers. I ultimately decided on four techniques 
that I wanted to experiment with: (1) a picture-vocabulary task , (2) an 
anecdote retelling task, (3) a repetition task, and (4) a situation task. 
The language I chose for the experiment was German. Because the language 
performance of an educated native speaker is the ultimate criterion of 
judgment, I had five educated native speakers of German participate in 
the experiment, along with ten educated n on native speakers. The four 
techniques are described briefly below, followed by a discussion of the 
results of the experiment. 

Vocabulary is one of the five specified factors for evaluating 
performance in an oral interview, and there is no question that the 
breadth and precision of vocabulary increases as the language learner 
approaches the leve^l of the native speaker. But it is often difficult to 
judge from an oral interview what words the examinee does and does not 
knovy. For this experiment I decided to select words that are quite low in 



-94- 



frequency but broad in their range of occurrence, i.e., objects that are 
very much a part of everyday life but not often talked about. These 
are words that native speakers are certain to know but that nonnative 
speakers would be less likely to have learned. The stimuli were pictures 
from German magazines. (The objects are listed in Appendix A.) Subjects 
were shown the pictures one by one and asked to identify the specific 
ubjects by name. They were asked to say so if they did not know the word 
tor » particular object. 

For the retelling task, each subject read five short anecdotes in 
German and retold each one in his own words immediately after it was 
read. (See Appendix B.) He was allowed as much time as he wished to read 
each anecdote, but he was not allowed to refer to the printed version 
after he began to retell it. The anecdotes were quite short, so memory 
was not really an important factor. 

For the repetition task, every subject listened to five recorded 
German sentences. (See Appendix C.) As each sentence was played the 
subject listened and then attempted to repeat it verbatim. The sentences 
ranged in length from three to five seconds, from ten to nineteen words, 
and from twenty to twenty-nine syllables. The idea for the task comes 
from a study done a few years ago by Merrill Swain and others at the 
Ontario Institute for Studies in Education. Swain rejects the notion that 
repetition or imitation is merely a perceptual-motor skill. She claims 
that if the utterance to be repeated is long enough (she used French 
sentences of about fifteen sy 1 lables ) , i t has to be decoded, stored, 
recalled, and encoded. This task is, of course, impossible unless the 
subject has some degree of proficiency in the language. The higher the 
proficiency, the better the ability to process the sentence and repeat 
it. The hearer must somehow match the incoming signal against existing 
words and structures in the language that he has stored in his memory. If 
the words and structures are not there, the sentence— or at least part of 
it--will evaporate and he will not be able to repeat it successfully. 

The fourth task was the elicitation of expressions in various sit- 
uations in an attempt to get at some of the sociolinguistic elements of 
language proficiency. Each subject was given ten cards on which specific 
situtions were described. (See Appendix D.) He was asked to read each 
card and say how he would respond in the situation. 

Of the five native speakers who served as subjects, two were under- 
graduate students at Cornell, two were graduate students, 'and one was the 
wife of a graduate student. All the nonnative subjects spqk^^ English as a 
_fi_rst language. One of them was an undergraduate student; the others 
were graduate students. All have lived in Germany for extensive periods, 
and It has been sajd of six of them (by people who are in a position to 
judge) that they "ipeak just like natives." Whatever the case, all of 
them would be rated 3+ or higher. 

The picture-vocabulary test was administered first. It performed 
verv well in distinguishing between the native and nonnative speakers, but 
it did not discriminate well among the nonnative speakers. Among the 



-95- 



native speakers, three of the ten objects were identified using the 
same ■ words, five were identified using various synonyms, and two were 
problematic because of the pictures. Among the nonnative speakers none 
identified the objects using the same word for all subjects; and for no 
object did all the subjects.use an acceptable word. The number of objects 
correctly identified by the nonnative speakers ranged from zero to 
three. * 

The effectiveness of the picture-vocabulary task can be demonstrated 
by three of the objects: a ball of yarn, a calf (of a leg), and an 
earlobe. These objects, by the way, were the three that all the native 
speakers identified with the same word. None of the nonnative speakers 
knew the word for " ball of yarn," although several of them said "yarn." 
One knew the word for calf, and five knew the word for earlobe. There are 
numerous objects that can be used for this task, i.e., objects that are a 
common part of the culture but thai nonnative speakers learn very, late 
in their acquisition of the language. I' feel it is a good supplement 
to the oral interview for testing at the higher levels. It also seems 
possible to assign difficulty factors to the various objects for a specif- 
ic language, thus assisting in making finer discriminations within the 
higher proficiency range. 

The retelling task not only discriminated well between the native and 
nonnative groups, but it also distinguished among the members of the 
nonnative group quite well. In all cases the native speakers retold the 
anecdotes with all the essential facts and using all key vocabulary. The 
performance among the nonnative group was spread across a broad range. In 
a couple of cases, the point of the story was completely missed. 

There were a cojuple of rather unexpected side benefits that made this 
task even more interesting. First, the native speakers tended to use a 
lot of little filleij and transition words and phrases that were not in the 
original story; the nrnnative speakers did not do this. Second, in many 
cases the nonnative speakers used vocabulary from the original story, but 
incorrectly, e.g., used the wrong gender or an incorrect past tense form. 
And, finally, it was obvious that some nonnative speakers simply did not 
understand the meaning of some of the words. This affected the retelling 
of the story considerably. The retelling task was the most time-consuming 
of the four, but it was quite productive. I did not take' the time to 
analyze each speaker carefully, but I am certain that the performance of 
the nonnative speakers could easily be rank-ordered according to specific 
observable criteria. 

The repetition task was quick and very effective. All the native 
speakers performed well on this task, having little difficulty repeating 
the sentences without errors. The performance of the nonnative speakers, 
on the other hand, was once again spread across a wide spectrum. None 
of them performed as well as any of the native speakers, but one came 
very close. Problems related directly to the length of the sentence 
and the vocabulary in it. The less proficient nonnative speakers had 
difficulty completing some of the longer sentences and tendeu to omit 



EKLC 



unfamiliar words and phrases. Also, similar words in sentences caused 
some confusion. One sentence, for example, has the words Ausserdem 
and ausserqewohnlich . The similarity of the two tended to create some 
confusion. Again, I did not make a careful analysis of ?ach performance, 
but I feel this task is an excellent technique for testing proficiency at 
the higher levels. 

The situation task was, without question, the most disappointing, 
although I am not yet ready to give it up. Whereas the native speakers 
retold the anecdotes with enthusiasm, they re3ponded to the situations 
rather unnaturally. In most cases, they had to think about them for a 
while. Two of the situations proved to be very unproductive: the "pretty 
shirt" and "being startled." Native and nonnative speakers alike seemed 
to be puzzled for answers. Some interesting observations were made during 
this task, although I am not certain how useful they would be for testing. 
When asking directions of the man on the street, most of the nonnative 
speakers began by saying "excuse me" (or the German equivalent), but none 
of the native speakers did. When responding to the salesman at the door, 
the native speakers merely said, "No, I*m too busy" or "I never buy 
anything at the door," Several of the nonnative speakers gave elaborate 
explanations. Although the task was less than successful in getting at 
the social communication I was looking for, I feel it can be developed 
into a useful technique, and further work should be done. Much depends 
on what the situation is and how it is described. 

I feel these four techniques can be valuable in assisting the tester 
to make judgments at the higher proficiency levels, -More research needs 
to be done to refine the techniques and to specify the criteria more 
closely. A bank of pictures, anecdotes, sentences for repetition, and 
situations can be built up, with each one tested and assigned a difficulty 
factor. It is hoped that the vague proficiency area between 3+ and 5 will 
thus be better understood and become easier to evaluate. 



-97- 



Appendix A 
List of Vocabulary Items 



(1) bottle cap (screw type), (2) calf (of a leg), (3) dog's nose, 
(4) dumbbell, (5) hubcap, (6) earlobe, (7) weather vane, (8) ball of 
yarn, (9) gasoline pump, (10) place mat. 



-98- 



Appendix B 
Anecdotes 

1. Moses Suppengrun .in Krotoschin verdiente mit seinem Getre idehandel 
30 viel, das3 er seinen Sohn studieren lassen konnte. Zum erstenmal kam 
der junge Moritz von der Berliner Universitat auf Ferien nach Krotoschin 
und sem Vater fragte ihn, was er nun eigentlich studiere. 

"Philosophie" , antwortete der Sohn. 

"Wie heisst? Was ist Philosophie?" 

"Will ich dir zeigen, was ist Philosophie.— Also de glaubst, de bist 
in Krotoschin, nicht wahr?" 

"3a, ich glaub', ich bin in Krotoschin", gab der Vater zu. 

"Pass auf, we^'d* ich dir mit meiner Philosophie beweisen, dass de 
nischt bist in Krotoschin!" 

"Nanu!" 

"Also, wenn de bi^t in Krotoschin, dann bist de nischt in Posen?" 
"Nein, dann bin ich nicht in Posen." 

"Wenn de bist nischt in Posen, dann bist.de doch anderswo?" 
"Is richtig!" 

"Nu, wenn de bist ancierswoV dann bist de doch nischt in Krotoschin?" 

"Is wirklich richtig", murmhlt der Vater und verfallt in tiefes 
Nachdenken. Auf einmal gab er s\inem Sohn eine gewaltige Ohrfeige. 

"Was ist?" rief dieser. "Warum seblagst de mir?" 

"Ich?" sagte der Vater und machte einXeber.so erstauntes Gesicht. Ich 
hab' dir nischt geschlagen! Wie kann ichNdir s'chlagen, wenn de bist in 
Krotoschin und ich bin anderswo?" 

2. Tunnes und Schal c.ind gestorben. Der eine kommt in den 
Himmel, der andere in die Hblle. Eines Tages haben beide Urlaub. und 
sie treffen sich auf einar Wolke. 

Der Schal, der aus dor Hblle kommt, erzahlt: "Ach , wir arbeiten 
am !age zwei Stunden, und das Quart ier ist anstandig und das Essen ist 
auc^) ziemlich gut." 



-99- 



Der Tunnes erzahlt aus dem Him m el! "Wir mussen jeden Tag 
zwoif Stunden arbeiten!'* 

"Wie?" sagte der Schal. "Wie kommt das denn?'* 

Tunnes: " Ja , wir haben eben zu wenig Leute!'* 

3- Es war kuiz vor Weihnachten, als ein armer Bauernjunge an einem 
Fenster des Bu rgermeisters eine fette Cans hangen sah. Er dachte: 
Mein liebes Ganschen, ciu hangst Lort obtn so einsam, ich will dich in 
eine gute Familie tringen. 

Arn Abend ging er heimlich mit einer Leiter zum Hause des 
Surge rmeisters. Langsam stieg er zum Fenster hinauf, an dem die Cans 
hing. Er hatte den fetten Vogel schon in der Hand, als er pldtzlich die 
laute Stimme eines Polizisten horte: "Halt! Was machst du dort oben?" 
Ohne die Nerven zu verlieren, antwortete der Junge: "Da bald Weihnachten 
\ ist,'will ich dem Herrn Bu rgermeister als kleine Uber raschung eine 
'^/ette Cans an das Fenster hangen." Der Polizist rief argerlich: 
\'\Unsinn, komm sofort herunter!" "Nun", meinte der Junge, "das ist 
vVjirklich scnade, denn jetzt muss ich die Cans wieder nach Hause mitnehmen. 

4. Ein junqer Amerikaner, der wie viele in diesen Tagen im Sommer 
nach Europa gefahren ist, kommt auf seiner Reise auch nach Italien. In 
Rorn kommt er ir\ einem kleinen Restaurant beim Essen mit einem Italiener 
ins Gesprach. Man erzahlte sich von den beiden Landern, ihren 
Menschen und ihren Eigentumlichkeiten. Der Amerikaner will seinem 
Freund erklaren, wie gross sein Land ist im Vergleich zu Italien oder 
anderen Landern. 

"Bei uns setzt man sich in einen Zug, und dann fahrt man eine 
Stunde, mehrere Stunden, sogar einige Tage, und danr, ist man immer noch 
in ATOrika." 

Da antwortet der Italiener unbeeindruckt : "Das kennenwir! Solche 
Ziige haben wir bei uns auch." 

5. Eine reizende junge Dame tritt in ein Seidenwa rengesch a f t . 
Der tadellos frisiq^rte und geschniegelte Verkaufer uberschuttet sie 
mit einer Flut von 1 ieben jwii rdigen Redensarten, und da die junge Dame 
keineswegs priide zu sein scheint, wird er immer verliebter. 

"Was kostet dieses seidene Band?" fragte die hubsche Kundin. 

"Einen Kuss der Meter!" antwortet schmaciitend der junge Mann. 

"Schon, packen Sie mir zehn Meter ein!" 

Als dies qt-schehen war, sagt die junge Dame lachelnd: "Warten Sie, 
draussen vor dem Schaufenster steht meine Gr ossmama , , die bezahlt fCir 
mich." 



ERIC 



-100- 



Appendix C 
Texts of Repetition Sentences 



K Ausserdem werden in diesom Jahr aussergewbhnlich viale 
Studienrate in den Ruhestand treten. 

2. Proteste gegen Kernkraf twerke hat es in den letzten Monaten 
in Hulle und Fulle gegeben. 

3. Aber es qeht mir heute Abend gar nicht um die Frage, ob die 
Stuttgarter Entscheidung richtig war oder nicht. 

4. Die Scwjetunion hat viele Millionen Tonnen Getreide in den 
Vereinigten Staaten gekauft. 

5. Gleichzeitig hat diese Meldung jedoch Tur die Schulen eine 
Schat t enseite . 



-101- 



Appendix D 
Situations* 



1. You are looking for the tourist office in an unfamiliar city. 
You go to someone who is standing on the street to esk directions. 
You say ... 

2. You are a guest for dinner at someone's house. You have almost 
finished eating, and the hostess offers you more food. You would like 
some, and you say ... 

3. You are in a department store and you accidentally step on 
someone's foot. You say . . . 

4. You are wearing a new shirt (blouse). Someone sees it and says, 
"That's really beautiful." You say . . . 

5. You have been speaking with a friend for about fifteen minutes. 
You have an appointment now and must go. You say ... 

6. You are speaking with a friend. He (sne) says something very 
startling about someone else. You say ... 

7. You are sitting quietly at a desk reading a book. Someone 
walks up and says something to you. You are startled because you did not 
hear h iin coming. You say ... 



8. You are invited to a party but you really do not want to go. 
You say (lie) ... 

9. You have been waiting for a friend for thirty minutes. Finally 
he (she) comes. You sa^' . - . 

10. The doorbell rings. You go to the door and find a salesman. 
He introduces himself and asks, '»May I come in for a few minutes?" You 
say ... 



*For the experiment, thp sentences were in German. 



I 



1 : i- 



ERIC 



-102- 
References 

Jones, Randall L. "The FSI Oral Interview." In Advances in Language 
Testing , edited by Bernard Spolsky. Arlington, Va,: Center for 
Applied Linguistics, in press. 

Swain, Merrill G. Dumas, and N. Naiman. "Alternatives to Spontaneous 
Speech." LEDRS: ED 123 872^ 

Vaiette, Rebecca. Modern Language Testing , rev. ed. New York: Harcourt, 
Brace, Jovanovich, 1977. 



t. 



In- 



TESTING SPEAKING PROFICIENCY THROUGH 
FUNCTIONAL DIALOGUES 



I. F. Roos-Wijgh 
Dutch National Institute for 
Educat i onal Measurement 



TESTING SPEAKING PROFICIENCY THROUGH FUNCTIONAL DIALOGUES 



I. F. Roos-Wijgh 

In this presentation I will deal with the following; (1) the 
teaching of modern foreign languages in the Nethtrrlands ; (2) the function 
of CITO (Dutch National Institute for Educational Measurement), the 
institute where I work; (3) recent developments in the tuition of speaking 
proficiency; (4) the purpose of the CITO speaking proficiency tests and 
a description of the area of language behavior covered by these tests; (5) 
the form and function of the tests; and (6) expectations for the future. 

Language Teaching in the Netherlands 

Modern foreign languages play an important part in secondary 
education in the Netherlands. The reason for this 13 that our language 
is spoken by very few people in comparison with, for instance, the English 
language. Moreover, there are numerous contacts with the surrounding 
countries, in both the economic and the touristic spheres. 

To give an impression of the smallness of the area of this part of 
'Western -Europe, the distance from Paris to Amsterdam is about the same as 
that from Boston to Washington. And in Paris they speak French, as you 
all know. Most of the Dutch population lives less than one hour^s drive 
from Germany, vhere they speak German. 

, Consequently, there is in the Netherlands a great need for being 
proficient in at least one foreign language, and this is reflected in the 
curriculum of the secondary schools, in which about 30 percent, (and often 
more) of the total time available is devoted to modern foreign languages. 
In the first three years of secondary education, English, German, and 
French are obligatory subjects; later on it is possible to drop one or 
two. You can also choose Spanish or Russian. Since the sixties the 
emphasis in language teaching has been more and more on the communicative 
aspect of language. One of the consequences is that now more attention is 
paid to speaking. 

[ n s o c i o 1 i ngu i s t ics methods were developed for describing this 
communicative aspect of the language and these methods are the base of 
modern curriculum development of foreign language education. 

CITO 

The. developments in foreign language teaching are reflected in 
tne activities of the language department of CITO. This institute was 
established in 1968 by the Dutch government, with the object of promoting 
the development of objective tests for the educational field. 

At first the language department occupied itself with the production 
of reading comprehension tests; later we also made listening comprehension 



-106- 



tests (both for use as final examinations). Recently we developed 
criterion-referenced tests for the first years of foreign language 
teaching and have started a project to develop speaking proficiency 
tests. As this IS a fairly new project, it is not yet possible to provide 
detailed information on the tests and their outcomes. But I will try to 
explain to you the underlying concepts and how the contents of the tests 
are determined. 

Recent Developments ,. 

First of all I'll give you some more background information about 
recent developments in the tuition of speaking proficiency. In the 
present situation it is usually the teachers that decide how they 
will test speaking proficiency. This means in practice that they ask 
students to tell something about the literary works they have read, 
or put questions to them with reference to a text. In other words, the 
students are simply asked to "say something about something." The CI TO 
project, "Testing of Speaking Proficiency," does not conform to this 
situation, but is based on new trends in the field of systems development' 
m language learning. 

Under the auspices of the Council of Europe, experts have defined a 
so-called threshold level. This level "may be seen as the lowest level of 
effective language use, thus defining a threshold at which language 
learning establishes general communicative ability minimally . adequate to 
the general range of language) situations in a speech community and which 
isthas an appropriate objective for initial language courses" (Council 
; ror Cultural Cooperation of the Council of Europe, 1973). It is essen- 
tially a level of oral communicative ability, designed for adult learners. 

The model for the defini'ion of language-learning objectives 
specifies eight components, but I'll mention only the most important 
ones for our tests. They are (l) the situations in which foreign language 
will be used, including the topics that will be dealt with; (2) the 
language functions (or speech acts) the learner will fulfill (e g 
giving information, asking for information); and (3) the specific (topic- 
related),- notions the learner, will be able to handle. As noted above, 
this threshold level was essentially developed for adult learners 
But now the author has published a special version of this- model 
for foreign language teaching in school's (V.Ek, 1975). Some of the 
suggestions of this adapted version have already been realized in a number 
of schools. There are schools that have special one-week projects on, 
for instance, shopping. The first thing required of the students is no 
longer to say something about something but to say something in a niven 
situation . ^ 

We =re now working on tests that" can serve as a sequel to this ■ 
development. What we., want to test is the ability to perform various 
speech acts in a foreign language in the form of a dialogue, with the 
student both taking the conversational initiative and responding. By 



-107- 



"dialogue'* is meant here the whole oT the dialogues that take place in 
communicatively relevant situations in which one is conf r onted with 
persons using that language. 

A description has been made of those situations that can be 
considered communicatively relevant when one is abroad or comes into 
contact with foreigners in one's. own country. The language behavior in 
such a situation is specified by the situation itself and any parts 
thereof, the roles played by the speakers in the situation, the speech 
acts that have to be performed, and the specific informational aspects 
connected with the speech act in that situation. 



Example 

When you describe the situation 
a part of that situation may be 
the roles played are those of 
speech jcts to be performed (bv guest) 



camping 

reception desk 

, receptionist/guest 

asking f or^^^nf orma t ion 
giving inf orm^aj: ion 
persuading v 
refusing 

yielding ^ 
expressing wishes ^ 
expressing (dis)sati53f act ion 



thie specific informationa,! aspects to 
be dealt with 



site for the tent 
number of persons 
equipment 

quietness (at night ) 
f acilit ies 

time of arrival/departure 



In this way we specified some fifteen situations, such as public 
transport, shopping, police station, entertainment , and camping. Of 
course there ar^ numerous jther situations; we only picked those that 
could be relevant to the majority of the learners. In these situations 
the thenijj of the conversation is i nl. r i ns i ca 1 ly quite stereotyped. At 
n rnilwny station you never ask, "What is the color of a return ticket 
today 

There are also communicatively relevant topics that are not limited 
to particular situations, and the language behavior in these cases will 
he far lens predictable. One can, for example, tell something about 
one 's hnf)hies at the edge of a swimming pool, or at g party, or in the 
comparLment of a train, etc. That is why a t^hetivatic specif icat ion of 
thi' langua()(^ hei-iavior required has been included ih the description of the 



ERLC 



11^ 



-108- 



area of language behavior'=^ that will be covered by the tests. The theme 
has been further specified as follows: (1) the theme and its subthemes 
'and (2) the speech acts to be performed with regard to the theme. 

Example 

\ 

When you pick out the thhs^ne: personal data 

subthemes are :\name , address, age, origin 

and the speech acts to be 

performed can be : iden'bdfying 

qualifying 

We listed the following themes: 

everyday life spending one's leisure time 

holidays home 
family/relatives hometown 
personal education information on one's own 

ambitions country and people 

interests current social and political 

problems 

The author of the threshold level concept does not make this 
difference between situations and themes; he just presents a list of 
^ topics. We, however, consider this distinction useful when you work out 
the system in more detail. "Railway station" can be a theme; you can talk 
about trains and railway stations anywhere. But when you consider it as a 
situation, i.e., when you take into account the setting of a railway 
station, you perform another kind of language -behavior . 

It. is quite obvious that .these descriptions are not exhaustive. 
Teachers will be consulted to find out what relevant themes and situations 
are still lacking. Moreover, they will have to indicate the priorities 
within L.'-.e area of language behavior. Besides the speech acts that 
are linked to themes and situations .we listed also a separate group of 
so-called social speech acts, which serve to start or end a conversation 
and to show courtesy, such as greeting, introducing oneself, inviting, 
thanking, taking one's leave, congratulating, and expressing best wishes. 

Thus, the language bohavior^ that is required by the tests can be 
classified according to situational specification, thematical specifi- 
cation, and social specification. 



orm and Function 

As the tests are based on a method of specifying language learning 
objectives that has only started to make its way into the schools, it 



-109- 



would be premature to offer them as selective final tests now. We are 
developing them chiefly to support the learning process in schools. 
They will, therefore, be introduced in cooperation with other institutes 
rendering services to the field of education., such as the National 
Institute for Curri^^ulum Development and regional school advisory centers. 
The test will be published in the form of a set of thematical and situa- 
tional tests. An index will make it possible to -choose several entries to 
the tests. 

Use of the tests can best De illustrated with the help of a practical 
example. Suppose a teacher of French wants the students to be able to 
communicate their accommodation needs to the receptionist at a camping 
site.. The teaq'her thus chooses from the index "situations" the test 
"camping." Th^ test begins with a short introduct-loxi_so the student 
knows what role 'he or she has to play. 

The first tasks set in this test are: 

1. Le soir, vous arrivez ^ la reception du camping. il y a 
une vieiUe dame. Saluez la dame! 

2. La dame dit "Bonsoir." Puis vous demandez une place & la dame. 
You can answer: Je veux/voudrais camper ici 

- une place (pour ma tente) 

- passer la nuit ici/au camping 

specification: role: guest/receptionist 

speech act: asking for information 
not ion :sitefortent 

3. La dame vous dit: II n'y a plu3 de place. Vous insistez, vous 
faites savoir que votre tente n'est que tres petite. 

- (Mais Madame) (je vous en prie) ma tente est 
tres petite. 

- meme pas une toute petite place? 

- (Vous etes sure) meme pour une 
toute petite tente? 

specification: role: guest/receptionist 
speech act: persuading 

notion: site for tent 

And 50 on. (The test comprisesten tasks.) In the second part 
of the test you make the acquaintance of your neighbor at the camping 
site. [he dialogue that follows can be characterized as a thematical 
dialogue; you are a.'sked to talk about your country and your hometown. The 
conversation runs as follows: 



-110- 



Le voisin dit: Vous n'etes pas frangais n'est-ce pas? 
The student answers: Non, je suis Hollandais. 

Le voisin: Ah, la Hollande. La capitale de Copenhague est magniTique! 

The student answers: Copenhague n'est pas la capitale de la 

Hollande. C'est Amsterdam. 

wKen they were pretested, these items proved to work. very well. 
T:,e students got so involved that they simply forgot they were in a 
test situation and answered very spontaneously, even indignantly in the 
"persuading" role. . / 

The test can be compared to a story; the tester as both narrator and 
actor. As the situation is a stereotype, the responses required are 
highly predictable. The teacher sets the tasks and has several pupils 
give the answers. He can note down the results in some way or other. The 
students can also practice among themselves and write down which tasks 
they were not able to perform. When a number of tests have been dealt 
with, it may turn out, for example, that most of the students cannot 
satisfactorily exchange greetings. The teacher then consults the "social" 
index . and,Xinds out in which of the other- tests "greeting" is also 
included. . , 

If in the "camping" test the students have shown they- are good at 
asking for information, the teacher can check the index for other tests in 
which "asking for information" also occurs. He can then check whether the 
studerits are also able ,to ask for information in other situations and with 
reference to other specific notions. 

After sufficient practice, the teacher can go through the whole 
"camping" test with a small separate group of students and giv3 them 
marks according to two aspects: Was the student successful in getting the 
message across? Is what has been said formally correct? At this stage 
the teacher's purpose is tc trace shortcomings, so a "soft" form of 
testing is sufficient, aimed at acquiring feedback for both teacher 
and student. Our team is presently working on the development of an 
elaborate rating scale. This presents us with enormous problems, such as 
determining what specific criteria have to be taken into account "in 
judging communicative ability. 



Expectations for the Future 

The development at CITO of speaking proficiency tests is still in 
its initial phase, but already teachers have shown interest in this kind 
of testing. A few^ tests have been pretested on a limited scale and 
experience confirms that the tests meet a long-felL need. 



1 > :■ 



-111- 



In a year's time the set of tests will be published and , after 
they have been in use for one or two years, CI TO will develop a final 
test that will be representat ive of the language behavior as it is 
described in the area of required language behavior. We hope that the use 
of these tests will contribute to new developments in foreign language 
teaching. Speaking the language in class will not be artificial but more 
practical and true to" life. The students will then be motivated, because 
they will find that they can really "do something" with the language. 
What they have learned at school will enable them to make contact with 
f o r e ign e r s and to communicate with them, both in their own country 
and while traveling abroad. They will be able in everyday life to say 
relevant th ings insteadof, for example , giving hardly intelligible 
expositions on the works of Sartre. 



-112- 



References 

Coste, D.; Courtillon, G.; Ferenczi, V.; Mart ins-Bait ar , M.; and Papc, F. 
Systfemes d 'apprent issaqe des k^ngues vivantes par les adu ltes: Un 
niveau-seuil. Strasbourg: Conseil de la cooperation culturelle du 
Conseil de 1 'Europe, 1976. 

Council for Cultural Cooperation of the Council of. Europe. S ystems 
development in ao ult language learning . A European unit/credit 
system for modern language learning by adults. Strasbourg, 1973. 

^- The Threshold Level . A European unit/credit system for 

modern language learning by adults. Strasbourg: Council for 
Cultural Cooperation of the Council of Europe, 1975. 



\ : 



SCOPE AND LIMITATIONS OF INTERVIEW-BASEO LANGUAGE TESTING: 
ARE WE ASKING TOG MUCH OF THE INTERVIEW? 



Robert Lado 
Georgetown University 



ERIC 



\ 

\ 



SCOPE AND LIMITATIONS OF INTERVIEW-BASED LANGUAGE TESTING: 
ARE WE ASKING TOO MUCH OF JHE INTERVIEW? 



Robert Lado 

Introduction 

The Physician^s Interview and Examination 

What happens when one goes to the doctor for a ser:ous examination? 
The doctor begins by interviewing the patient: "How do you feel? . What 
seems to be the problem? How long has that been bothering you? , Have 
you had those symptoms before? When does it hurt? How is your appetite? 
Are you able to sleep at night? What is your normal weight? Have you 
been losing weight lately?" And so forth. 

Your attitude is one of cooperation with the physician; that is, you 
do not try to mislead the doctor or hide your symptoms. Yet, as a 'rule, 
the doctor does not make a serious, final diagnosis directly from the 
interview and first-hand observation of your appearance and behavior. 
Questions are raised in the doctor's mind. Mental notes are made as the 
interview proceeds. Hypotheses develop and are often discarded to make 
way for other possibilities. 

Depending on the observations made during the interview ,. the doctor 
proceeds with a number of. specific tests. The doctor or a trained nurse 
takes your exact weight instead of accepting your report or making an 
estimate from your height and the look of your waistline. Stunt men at 
carnivals ca i make remarkably accurate guesses of your weight by simply 
looking at you, and they ibet they can guess within five pounds of it or 
you win a prize.. Yet your physician asks you to step on the scale and 
measures your weight to within, a pound or less. The carnival estimator 
bets he can come within a ten-pound range, and he does not always win. 
The physician would not even consider recording a sharp-eye estimate. 

In addition, the doctor may listen to your heart, check your pulse, 
or listen through a stethoscope as he taps your chest. He or she does 
not just hold a tight grip on your arm to estimate your blood pressure as 
circulation begins to pulse through . A sphygmomanometer *7)easur.°s that 
pressure so a reading can be made from the height of a mercury column 
against a scale, or from a needle pointing to a circular scale. And 
notice that to take your pulse rate the physician or the nurse looks at a 
watch as a count of cthe pulsations is made. It is easy to train yourself 
to count seconds quite accurately, yet physicians prefer to look at 
watches. 

The doctor may take a- chest X-ray and examine it, make an electro- 
cardiogram, tap your knee for reflexes, and look at your throat, ears, and 
nose. If there is a hearing problem,' the doctor does not just vyhisper to 
see if you hear; he or she asks for an audiology test , which measures 
responses at different sound frequencies. 



lit- 



-116- 



The physician may take one or more blood samples, or collect a urine 
specimen, which will be sent to a laboratory and tested For sugar, 
infection, albumen, or whatever. i 

Only after the doctor has collected the results of the various 
specific tests and interpreted them together with the interview does he or 
she attempt to reach a final diagnosis and prescribe treatment. If the 
results are inconclusive or contradictory, additional tests are ordered. 
Where would modern medicine be if doctors depended exclusively on the 
interview and direct observation of patients? 



The Oral Interview Test (PIT) 



part of the physician) 
suggesting topics, and 



In the OIT, a trained linguist (.the counte 
elicits samples of speech by asking questions, 

probing into the usage of the examinee. The examinee, often in an 
antagonist role, tries to exhibit the best usage and avoid pitfalls that 
nught lower the rating. The trained examiner keeps on probing until 
satisfied that. the true level of performance has beep established or until 
time becomes a problem. 



Unlike the medical examination, the OIT does not lead to additional 
tests to obtain a more complete picture of competence in specific areas. 
Instead, the examiner searches for questions and- topics that might elicit 
desired responses and exhibit weaknesses and competencies. 

It may be argued that linguistic '^'mpetence is less complex than the 
functioning of the human body, yet inguistic competence is one of the 
most complex achievements of a huma. being. In research on linguistic 
geography, it takes interviewers many hours of exploration with the aid of 
questionnaires to report the speech characteristics of a single informant. 

By contrast, in an OIT, which lasts from five to thirty minutes, the 
examiner immediately reaches the diagnosis or rating that says what the 
examinee can and cannot do in. and with the language, or, to use the Civil 
Service ratings, that the examinee is native-bilingual, full professional, 
minimum professional, limited working, or elementary in speaking.-^ 

From this observation of the examinee in conversation,- theWaminer 
decides finally and irrevocably if the examinee can perform full profes- 
sion a funct ions through the language. And, because of the faqt that 
there is a face-to-face conversation, the examination is considered a 
valid replication of professional function, which it is not. 



When the physician suspects there might be a problem related. to \the 
weight of a patient, he or she reaches for the exact weight measurement 
and does not trust an approximate estimate. The oral interview examined, 
however, trusts the approximate estimate. When the physician suspects\ 
hearing problem, he or she does not stop with direct observation, but\ 
studies the audiqqrnm showing thresholds at variou^> frequencies on the\ 

V 



-117- 



Sfxjnd s.Qectrufri. And the physician puts more trust in the audiogram, which 
separates the elements of sound into frequencies, than in the integrative 
informal test of speaking to determine if the patient hears normally or 
not. Yet, in the OIT, the examiner does not use any specific measures 
beyord direct observation of the behavior of the examinee because it is 
supposedly more valid to do so than to seek more precise information 
by means of additional tests of various elements. Where can language 
examinations qo if we insist on exclusive reliance on direct impression 
excjmiricit ions for ^ur final diagnoses? 

Evaluation 

So far we have argued only by analogy, and analogy does not prove 
anythinrj. But we would have to be blindfolded not to recognize that the 
analogy raises some interesting questions about the possible limitations 
of the technique. It seems to me that we are justified in assessing in 
a more formal way the fundamental strengths and weaknessps of the OIT. In 
testing terms., this means inquiring formally into the validity, 
reliability, scorability, representativeness, and practicality of the 
test, and determining what it does and does not do well and how it 
can be modified or combined with other techniques to produce better 
results. 



Validity 

Validit/ is the most important single criterion to evaluate a test. 
It is critical because' without validity all other' criteria, including 
reliability, are worthless. Validity simply asks whether and to what 
extent ci test measures what it claims to measure. There is no absolute 
rind finnl ansvyer to , the question of va lidi cy ,* since a test only samples 
.what it purports to test. Instead, we search for evidence that supports 
or wonken.s its claim, and then, on the basis of all the evidence, we make 
a judrjmont. 

There are many ways v^e can seek evidence to answer the validity 
qiiestu/n. Some of the most convincing evidence comes from (1) face 
validity, (2) content-of-sample validity, (3) native speaker performance, 
and (^) empirical or statistical validity. 

FACF_ VALIDITY. The greatest strength of the OIT is its surface or 
face validity, i.e., the appearance on simple inspection that it tests 
speak intj, which is what it claims to test. The OIT has all the appearance 
of testini] fveakinq ability: it is actually a speaking performance on the 
part of the exaniinee and a speaking performance is not a substitute for 
speak inrj but speaking itself. 

If we were to 'rely on face validity alone, we would give the OIT the 
highest validity rating as a speaking test. Such a rating would be amply 
justified if speaking o language were as simple as riding a bicycle or 



-US- 



driving an automobile. By analogy, the Oil would be equivalent to the 
road test of a driver's examination. 

But mastering a language is more complex than driving a car, and on 
the basis of the questions raised by our analogy with the physician's 
examination, we should go beyond face validity into a deeper evaluation of 
the 01 T. Even in a driver's examination, it is common practice to take a 
written test prior to the road test. And the road test itself is not 
merely driving around the nearest block but a series of tasks that probe 
the competence of the driver in various maneuvers. 

With regard to the OIT, we notice immediately that it is a restricted 
sample of speaking that, as such, may or may not give a fully accurate 
picture of linguistic or communicative competence. This leads us to 
content-of-sample validity. 

C0NTENT-0r-5AMPLE. Content in a language test refers to the language 
and the situations tested. We know that language is a system of rules, 
patterns, and lexical items andtheir meanings used by a speech community' 
to communicate and interact in carrying on the multiple functions typical 
°^ ll^''^ that community. We should, therefore,' inquire into the content 
or the OIT with regard to grammatical system, vocabulary, pronunciation 
situations, and fluency. ' 

Grammatical System. In the OIT the examinee may not have sufficient 
opportunity to ask questions, for example, or to use requests, invi- 
tations, or exclamations, or use various types of complex sentences 
or passive or reflexive constructions. The experienced examiner guards 
against such lacunae but may not be able to elicit utterances containing 
important elements of competence such as the . different types of questions 
including those of the yes/no, information, subject, verb phrase, 
predicate, and echo types, among others. 

We all agree that the total language system cannot be tested in one 
interview and that we must, therefore, be satisfied with a sample. But 
how IS that sample to be criosen? By subjective impressions? By error 
counts? By linguistic analysis? Without precise criteria concerning the 
sample, there is bound to be variation among interviewers; and from one 
.'.nterview to another with regard to the elements elicited. An informal 
general list such as examiners often have in mind allows too much 
variation. 



a n 



In a recorded OIT of Spanish, which lasted twenty minutes and yielded 
S rating of 4, the examiners asked fifty-five questions and the 
examinee none (DeCesaris, 1977). The examiners made a clear effort to 
elicit the subjunctive and conditional forms, but they overlooked the area 
of interrogatives completely. 

Years ago, I was called as a consultant to evaluate an OIT under 
deve/cpment for the Air Force to test illiterate Puerto Rican recruits in 
spo. en Lnglish. It was a carefully structured interview that sought to 



-119- 



test competence in a number of areas. On examining it, I discovered that 
it did not provide for questions to be put by the examinee. 

Vocabulary . We know that even full bilinguals do not have comp-etely 
parallel competence in all lexical areas of the two languages. I, for 
example, feel less competent to dis.cuss psychology in Spanish than in 
English, because practically all my study of psychology was in English, 
but I feel more competent to discuss literature in Spanish. Should the 
topic be soccer, I would again do better in Spanish; if it were current 
movies, 1 would do badly in both. Yet, on the basis of a conversation on 
some informally chosen topic, the OIT may report a rating of S-4, full 
professional proficiency, which is described bs *'able to use the language 
fluently and 'accurately on all levels normally pertinent to professional 
needs," without necessarily sampling the lexical areas in which full 
professional competence has been achieved. 

Pr onunc i a t ion . The OIT provides a highly valid sample of an 
examinee *s competence in pronunciation, with respect to both face validity 
and cont ent-of-sample validity. Practically all the phonemes and phoneme 
sequences of the language and most of the intonation and rhythm patterns 
will be exhibited. There are problems with regard to scoring, but not 
with validity. 

Situational Content . One of the strengths of the OIT is that it 
represents performance in a communicative situation. This is more valid 
than reciting memorized texts as a measure of speak iny , and it is more 
valid than a repetition test described by Politzer et al. (1974). It is 
more valid than the noise test, which is essentially a dictation with 
noise interference, as reported in Spolsky et al. (1968) and Gaies et al. 
(1977). 

By attempting to introduce different questions and tasks, the 
examiner tries to. improve, the situational content. In this sense the 
nil can be more effective than a picture stimulus test if the examiners 
are experienced. Nevertheless , the OIT is not fully representative for 
two reasons. (1) The OIT is a test of conversational competence rather 
than of extended formal speaking. It does not sample the ability of a 
professor to deliver a lecture to a class, or of an ambassador to give a 
public lecture, as ambassadors are often irz/ited to do.. (2) It does 
not sample sociolinguiistic varrations, which are sometimes critical in 
effective communication. Notice, for example, variations required 
in addressiocv men and women, older and younger persons, individuals of 
high status, and in-house employees of different sociolinguistic status. 
Of course, t?iese differences could be deliberately sought ot't in the 
i'^terview anu i^.ecome part of it. The question would be then whether 
th.^! on were too lung. Would its spontaneity be hampered? Could these 
variations be tested by other rr-oans? 

F 1 u f h c y . Fluency is sampled quitt; adequately in the OIT. As with 
pt ununc; at ion , Einy problems with regard to Tluency will be in scoring 
rather than in validity. Are all examiners rating the same thing when 
thny rate fiuencv? Should it be .more explicitly defined? 



-120- 



NAFIVE SPEAKER PERF0P;4ANCE . The OIT seems strong with regard to 
native speaker performance. All examinees would presumably perform at 
a rating level of 5 if tested in their native language. Yet there is one 
area that leaves some doubt in my mind. It is the matter of poise, 
personality, and presence. Would all examinees give a typical performance 
each time if tested in their native language? We are intuitively 
aware that we do not always perform at our best under all circumstances. 
Is there any substance to this impression? 

Differences in performance among educated adults may not turn out to 
be of major importance, but differences among children are substantial , as 
reported by sociolinguistic studies of ghetto children. I recently made a 
sound movie of a two-year-old Spanish-speaking child learning to read 
Spanish. The parents had reported that he was able to read three books of 
an experimental series. Yet, when we attempted to film his performance, 
he did noL rerd a single word, even though the filming was at home with 
his parents. The OIT is not a test for two-year-olds, of course, but it 
would be interesting to to.jt some adult examinees in their native language 
to see what performance they actually display. • 

EMPIRICAL VALIDITY. A standard empirical validation of a test is its 
correlation with a valid criterion. The valid criterion could be the 
scores on a speaking test whose validity has been previously established. 
With the HIT we cannot use this approach because we simply do not have a 
fully validated and established speaking test. 

Fo obtnin a more valid criterion, we will have to turn to (1) a more 
extended version of the OIT with adequate sampling of situations and 
language, (2) an increase in the number of graders or an increase in theii 
competence, or ,3) a combination of the above. If it turns out that the 
01 r correlates highly. with the longer and better-structured version scored 
by a group of qualified examiners, we would be justified in considering 
the Oir validated. 

I have not seen such a validation attempt. Instead, I have seen 
n oroposal that a shorter version be correlated with the full OIT to 
v'jiidate the shorter version. Obviously, if the shorter version 
correlated highly with the normal length OIT, we would gain by the 
practical advantage of its shortness. 

However, since we are st ill exploring possible limitations of the 
Orr, its validation with a longer, structured OIT scored by more than two 
judges would seem to be of greater interest. Another possibility is the 
use of in-depth interviews supplemented by additional tests. 



Reliability 

Reliability has to do with the stability of obii^ained scores. If 
scores fluctuate excessively for the same students on)repeated adminis- 
trations, the test is unreliable. The extent to which scores are reliable 




-121- 



is expressed as a correlation between two sets, of scores made by the same 
students on 'che same test. In reliability, then, the test is' correlated 
not with a separate criterion, as in empirical validity, but with itself. 

The fewer the possible grades on the scale of a test, the easier it 
is to attain high reliability. The extreme case is a pass-fail test with 
a single cutoff point between passing and failing. Most students will be 
either far above or far below the cutoff point and thus assure high 
reliability since only those that are close to the cutoff point are likely 
to fluctuate. 

The OIT rating scale is based on nine effective slots, 0+, 1 and 1+, 
2 and 2+, 3 and 3+ , and 4 and 4+. It is not difficult to attain high 
reliability with such a scale. If scores were distributed ove." fifty or a 
hundred points on the scale, we would expect the reliability of the OIT to 
be lower. 

The nine-point scale is apparently satisfactory for present govern- 
ment users of the test. For academic purposes, however, it is too coarse 
and tends to bunch up scores around the 1 and 1+ ratings, masking progress 
Within and between them. The nine-point scale is a weakness also for 
control-type research because it tends to flatten out significant 
differences in achievement in the range where most scores fall. 

Wilds (1975), while staunchly affirming, "The fact of the matter is 
that thir, system works," admits that 

Even in languages in which tests are conducted frequently as 
French and Spanish, where there is no .doubt that standards are 
internalized and elicitation techniques are mastered, it is 
possible for criteria to be tightened or relaxed unwittingly 
over a period of several years so that ratings in the two 
languages are not equivalent or that current ratings are 
discrepant from those of earlier years. 

and 

It is, however, very much an in-house system which depends 
heavily on havintj all interviewers under one roof, able to 
consult with each other and share training advances in 
techniques o[ solutions to problems of testing as they are 
developed and subject to periodic monitoring. It is most apt to 
hreak down as a system when examiners are isolated by spending 
long periods c^way from home base (say a two-year overseas 
ass irjnmen t ) , by testing in a language no one else kngws, or by 
testing so infrequently or so independently that they evolve 
tfioir own system. (p. 35) 



J: 



-122- 



the fact that two examiners -are required to rate the OIT indicates 
lack of confidence in the rating by one examiner. This compares 
unfavorably with standard practice in testing, which as a rule relies on 
one scorer* Because of weaknesses in reliability, the practice of using 
two examiners should be maintained if practical from the point of view of 
trained personnel and cost. Dyson (1972) found that a shorter examination 
with team marking was better than a longer test with a single marker. 

Scorability 

The subjective nature of the OIT scoring is one of its weaknesses in 
its present form and use. According to Clark (1975), it takes four full 
days to train an examiner. And Wilds (1975) indicates,, as. quoted above, 
that examiners who are out in the field for two years must be retrained. 
The CIA has its two examiners rate the interview separately, and averages 
the ratings on a scale. The FSI has, the interviewers discuss their 
differences to arrive at an agreement. These are indications that scoring 
the OIT is difficult and subjective to a significant degree. Improvement 
in this area is obviously desirable. 

A standard way to improve objectivity in scoring is to identify the 
measurable parameters of competence. The rating scales for accent, 
grammar, vocabulary, fluency, and comprehension reported by Wilds (1975) 
represent an effort in this direction. One may be puzzled, -however, by 
the weights of the different components: three points to grammar, two to 
vocabulary, one to fluency, two to comprehension, and zero to accent. 

This cannot mean that pronunciation is not an important factor in 
speaking. Pronunciation contributes to intelligibility even though 
redundancy resolves many inaccuracies in pronunciation . Furthermore, 
sociolinguistic studies show that foreign language accentedness and social 
dialect markedness are perceived and judged by native speakers very 
quickly. A speaking test must, therefore, be considered incomplete until 
pronunciation is taken into account, either on a complex scale showing 
foreign and social dialect dimensions or on an inventory of pronunciation 
features or phonemes and sequences. And if this makes the OIT too 
difficult to score by available examiners, it should be supplemented with 
a pronunciation test of some kind to give us a better picture of speaking 
skill. 



Pract ica 1 i ty 

Practicality must be considered in conjunction with the particular 
uses intended for the OIT. The FSI, CIA, Peace Corps, and other agencies 
and organizations that have the trained personnel on hand and can keep 
careful control of ratings find the OIT practical. The estimated cost of 
$35 per examination (Jones, 1975, p. 9) and the fifteen interviews that can 
be administered by a team of two examiners in a working day (Clark, 1975, 
p. 20) are also acceptable to those users. A twenty-minute interview by 



-123- 



two trained examiners limits the use of the OIT in university and high 
school set t Lngs for practical reasons. It would take a team of examiners 
a full working week and two additional days to test 100 students, a not 
uncommon task in those settings. 

'If the Oir were shortened to, say, five minutes, its practicality 
would be significantly enhanced. If, in addition, a" single examiner were 
used, subject to checking by a second examiner when challenged, a further 
improvement in practicality would be effected. 

The OIT as a Listening Comprehension Test 

The 01 r shows obvious weaknesses as a listening comprehension 
instrument. In the interview that I analyzed from a recording, the 
examiners asked fifty-five questions and the examinee required 
clarification only once. In speaking, however, the examinee did not 
ask any questions. The speaking sample was exclusively expository and 
narrative. In listening comprehension it was all questions and mo 
narration or exposition. This represents a weakness in content-of-sample 
validity. Furthermore, it is doubtful that any careful check could have 
been kept on comprehension, since attention was on speaking. 

Kaufman (1969) compared the S-ratings of forty-four Peace Corps 
volunteers on the OIT with their listening comprehension scores on the 
Pictorial Auditoiy Comprehension Test (PACT) developed for the Peace Corps 
by John B. Carroll. PACT is a seventy-five item multiple-choice lest 
that uses four pictures as alternatives for each item. The tests were 
administered after a nine-week intensive course m Spanish conducted in 
Puerto Rico. The interviews were administered by Kaufman shortly after he 
was recertified by the Foreign Service Institute to administer the OIT in 
Spanish to Peace Corps volunteers. Kaufman was assisted throughout the 
oral testing by a Puerto'Rican and a Colombian, who had not been involved 
in the training of these volunteers. 

The S-ratings on the OIT and the listening comprehension (LC) scores 
on PACT are presented in Table 1. The correlation between the two sets of 
scores, using the Pearson product-moment linear correlation formula, 
was .83. This is fairly high and. could be used to compare performances by 
groups of similar students. Looking into a comparison of performance by 
individuals, however, a different picture emerges. 

Dividing the PACT scale into nine intervals to parallel the nine OIT 
ratings, and equating the two scales at their modes, (the slots with the 
largest number of scores in each scale), we note that 68 percent of the 
students who rated within the five levels 0, 1, 2, 3, and 4 (without 
separating the 0+, 1+, etc.) also rated within the corresponding double 
intervals on the PACT scores, while 32 percent were either above or 
below. Using the full nine-point scale on both the OH and PACT, 36 
percent of the students remained in the same slot and 64 percent were 
either above or below. 



-124- 
TABLE 1 

Spanish OIT S-Ratings -i PACT CC Scores of 44 Peace Corps Volunteers 



PACT LC 
Scores 



(4+) 
(4) 

(3+) 

(3) 

(2+) 

(2) 

(1+) 



71 
* 70 

68 
67 
66 
65 
64 
63 
62 
61 
60 



59 
58 
57 
56 
55 



54 
53 
52 
51 
50 



49 
48 , 
47 
46 
45 



(1) 



(0+) 



44 
45 
42 
41 

40 

w 

38 
37 
36 
35 



34 
33 
32 
31 
30 

♦ ♦ ♦ 

20 



|4. 



J) 



2+ 



2+ 



1+ 



1+ 



1 1' 
1 

1 1 

1111 
1 1 
1 
1 

1 1 1 
1 

1 

111 

1 1 1 
1 

1 1 



[Oh 



, OIT S- 

RATIi\GS 4+ 4 3+ 3 2+ 2 1+1 

♦Indicates what the LC rating would have been if measured by PACT, 



-125- 



In other words, if we use the OIT speaking rat trigs to predict PACT 
listening comprehension performance using a nine-slot rating scale, we 
are off by at least one level in approximately two-thirds of the cases, 
indicating that the TIT S-ratings are not satisfactory measures of 
listening comprehension. The reverse would also be true;. that is, if we 
use PACT 1 isten mg comprehens ion scores to predict speaking performance in 
terms of OIT ratings, we are off by at least one level in approximately 
two-thirds of the cases, indicating that PACT listening comprehension 
scores are not valid measures of speak mg performance. This is further 
cor+F-rrmed by looking at some specific cases. We notice, for example, that 
ane student rated 2+ by the OIT would be rated 4+ by PACT. Another 
student, with OH 2, would rate PACT 4. And a third student, with OIT 1, 
would rate PACT 2+. 

Consequently, since a listening comprehension test can be admin- 
istered with ease to individuals as well as groups by examiners with 
standard training, and since results are scored objectively and quickly, 
st»parr3he listening comprehension tests are to be preferred m all cases in 
which oxaminees are willing to submit to them. 

What the PIT Does and Does Not Do Well and What to Do about It 

Selecting and condens mq some of the above considerations, it is not' 
unreasonable to state the following conclusions and recommendations. 

1. The 0 I^r Ls the best available test to obtain a valid speaking sample. 
it should, therefore, be retained when the necessary requirements with 
reqnrti to personnel training and availability and budget provisions 
are pre sen it. 

2. Iht? representativeness of the speaking sample is less satisfactory 
than that of professionally prepared tests of listening comprehen- 
sion, reading, and writing. Therefore, the OIF should be further 
structured to ensure better sampling of linguistic, situational, 
and soc lol mqu i st ic components, or it should be supplemented by 
other tests that are raore effective in thosp are?s. The OIT could 
then be shortened to a more practical and uniform length. 

3. Scoring of the 01 I is unusually difficult and must be presumed 
uneven under ordinary testing conditions. This problem can be 
minimized by not relying exclusively on the OIT but supplementing 
it instead with other objective tests. 

The 01 r IS not a good test of listening comprehension by psychometric 
standards. It should, therefore, not be used as a measure of that 
skill. Listening comprehension tests are far superior and can be 
administered individually as well as m groups at a fraction of the 
cost of the 0[r and with lower demands on personnel training. 



-126- 



5* The OIT is not a practical test of competence on internalization 
of grammar, vocabulary, and pronunciation, because of sampling 
and scoring problems. Therefore, it should be supplemented whenever 
possible with tests of those components when they are deemed 
necessary. 

6* The QIT. is not a test of reading or writing and should not be used 
as a measure of those skills.. This Is stated to counter any claim 
that language competence is general in nature and need not be tested 
in its' different manifestat ions. 

7. Since the OIT is difficult to administer and score, and because it 
requires highly trained personnel not always ava liable it should 
be restricted to VIPs who might not be willing to submit to other 
types of tests. For wider use, a short version of the OIT with 
more limited goals, supplemented by additional tests, is 
recommended. 



Conclus ion 

To the query whether we are asking too much' of the OIT in its present 
form, the answer is yes. Therefore, we should either ask less of the 
interview and supplement it with tests that are better adapted to some. of 
the components, or, rejecting that, we should extend the interview and 
structure it so it will provide a better sample of linguistic, situa- 
tional, and aoc iol ingu ist ic competence. 

More specifically, in this observer's opinion, we should keep the OIT 
since it is a valid test of speaking and supports teaching and evaluation 
of speaking, but we should make it shorter, more uniform in length, and 
supplement it with tests of listening comprehension, reading, grammar, 
vocabulary, pronunciation, and writing for a more complete picture of 
competence. We should also increase the number of subcategories under 
each rating so as to reflect more adequately the vast achievement that 
mastery of a second language represents. . ^ 



-127- 



References 



Beardsmore, H. Baetens. "Testing Oral Fluency." IRAL 12 (197A): 317-25. 

Clark, John L. D.. "Theoretical and Technical Considerations in Oral 
Proficiency Testing." In Testing Language Proficiency , edited by 
Randall L. Jones and Bernard Spolsky, pp. 10-24. Arlington, Va.: 
Center for Applied Linguistics, 1975. 

Coward, D. A. "Confessions of an Oral Examiner." Moder n Languages 68 
(1977): 35-38. 

Davison, J. M., and Geake, P. M. "An Assessment of Oral Testing Methods 
m Modern Languages." Modern Languages 5! (1970): 116-23. 

\DeCesaris, Janet. "The F&l Interview." Unpublished term paper with 
\ cassette recording, Georgetown University, 1977. 

[i\vson, A. P. "Oral Examining in French." ' Moderh Language Journal 53 
\ (June 1972):. 54-55. 

Ga;es, S. J.; Gradman, H- L.; and Spolsky, "Toward the Measurement if 

Functional Proficiency: Contextual izat ton of the Noise Test." TESOi . 
Quarterly 11 (1977): 51-57. 

Johansson, S. "An Evaluation of the Noise Test — A Method 'for Testing 
Overall Secon'J Language Proficiency -by Perception Under Masking 
Noise." IRAL 11 (1973): 10*7-33. 

Jones, Rancall L., and Spolsky , Be rna rd , eds. Testing Language 
Prof iq lency . Arlington, Va.: Center for Applied L inguistics, 
1975. ^ . 

Kaufman, David. "Compar isnn of Speaking Proficiency with Auditory 
Comprehension — An Experiment." Unpublished term paper, Georgetown 
University, 1969. 

Pol itzer , Robert Hoover , Mary Rhodes; and Brown, Dwight. "Test of 
Proficiency in Black Standard and Nonstandard Speech." TESOL 
Quarterly 8 (1974): 27-35 . 

Rey, Alberto. "A Study of the Attitudinal Effect of a Spanish Accent 
on Blacks and Whites m^South Florida." Unpublished doctoral 
dissertation, Georgetown University School of Languages and 
Linguistics, 1974. 

Shuy, Roger W. "Sociolinguist ics. " In Linguistic Theory: What Can It 
Say about Reading ?, edited by Roger Shuy, pp. 80-94. Newark, Del.: 
International Reading Association, 1977. 



ERIC 



-128- 



olsky, Bernard; Sigurd, Bengt ; Sako, Masahito; Walker, Edward; and 
Arterburn, Catherine. "Preliminary Studies in the Development of 
Tychniques for Testing Overall Second Language Proficiency." 
Language Learning 18 (August 1968): 79-101. 

Ids, Claudia P. "The Oral Interview Test." In Testing Language 
Proficiency , edited by Randall L. Jones and Bernard Spolsky," 
pp. 29-38. Arlington, Va.: Center for Applied Linguistics, 



MEASURING FOREIGN LANGUAGE SPEAKIMG PROFICIENCY 
A STUDY OF AGREEMENT AMONG RATERS 



Marianne L. 
Foreign Service 



Adams 
Instixtu te 



MEASURING FOREIGN LANGUAGE SPEAKING PROFICIENCY: 
A STUDY OF AGREEMENT AMONG RATERSl 



Marianne L. Adams 

■> Background 

Proficiency in speaking a foreign language is more often inferred 
than directly measured. Perhaps this is because of the difficulty of 
scoring speaking examinations objectively. Yet, in an organization whose 
purpose it is to communicate with foreign nationals, foreign language 
proficiency must be measured, because inferring a person's speaking 
proficiency from the person's ability to read, write, or listen may not be 
valid. Although the assessment of speaking proficiency, is difficult, the 
responsibility is unavoidable. 

The School of Language Studies at the Fr reign Service Institute 
(FSI) trains and tests government employees for overseas service. The 
purpose of the testing program is to provide information about the profes- 
sional usefulness of a given person's knowledge of a language. "How much 
of the business, of the United States government in country X would the 
employee be competent to do in language X?" is the question FSI attempts 
to answer. One key feature of the testing program is that employees are 
assigned to "proficiency levels" based on their oral test performance. 
Eniployee proficiency level assignments are based on the match between an 
employee's oraL. test performance and prespecified levels of performance 
required for each proficiency level. Therefore, the Foreign Service 
Institute language proficiency test is referred to as a "criterion- 
referenced test." 

The speaking portion of the FSI language proficiency test consists of 
an oral interview structured with reference to the proficiency levels. 
The candidate is always asked to converse with a native speaker of the 
target language on topics as complex as he or she can manage. Three 
people take part in the test: the candidate, an interviewer, and 
an examiner. The last is in charge of the test and, while mostly the 
examiner listens, occasional ly he or she directs the conversation. 

Criterion-referenced tests are often contrasted with the better known 
norm-referenced tests. A norm-referenced test is constructed and used 
principally to facilitate making comparisons among individuals on the 
ability measured by the test. Clearly, a norm-referenced test would 
not meet FSI's needs. Because the purpose of a criterion-referenced 
test--to provide a clear description of what a capdidbte can do--is 
fundamentally different from that of a norm-referenced test, it is 
not surprising that methods for test development and evaluation differ 
considerably for the two types of tests (Hambleton and Novick, 1973; 
Millman, 1974; Swaminathan, Hambleton, and Alqina, 1974). 



iThe author would like 
construct ive criticisms 
Massachusetts , Amherst . 



to acknowledge thehelpful comments and 
of Ronald K. Hambleton of the University of 



-132- 



The test is widely used and enjoys a good reputation. It has been 
adopted by organizations faced with the need for speakers of foreign 
languages, e.g., the Peace Corps and some businesses. The test has both 
content validity and face validity and a clientele that has substantial 
confidence in the reliability of the proficiency ratings. Nevertheless, 
there is an ongoing need for technical analyses of the test and its 
characteristics . 

The study reported here was designed to address the problem of 
agreement among different raters of proficiency level assignments to the 
same set of candidates. Specifically, the study was designed to address 
the following questions: 

1. Could the selection of a rater influence prof iciency level 
assignments (and if so, by how much)? 

2. What would be the nature of disagreements in ratings? (For 
example, do disagreements in ratings between two examiners follow a random 
pattern?) Also, since some disagreements are more serious than others 
(mastery-nonmastery determination), what percentage of the time do raters 
agree in their mastery or nonmastery determination of candidates? 

3. How do the results from questions 1 and 2 above compare for tests 
in three languages: French, German, and Spanish? 

These questiohs refer, of course, to only one aspect of the test; 
the individual rater. In the actual work situation, however, ho rater 
judges a test alone. Raters always work in pairs. The pairs of raters 
also work under well-defined testing procedures and criteria of the 
test. 

The results are underestimates of true reliabilities because many of 
the inconsistencies are removed by consultation. In this study we let 
inconsistencies stand. 



Def initions 

At this point it will be useful to define several .terms : 

Oral Incerview --A test of speaking proficiency in a foreign 

language . 

2. Foreign Lanquaqe -'-'There were three foreign languages of interest 
in this study: French, German, and Spanish. 

Proficiency Scale -'-The scale consists" of eleven points: 0, 0+, 
J, 1 + ,..., 4, 5. The labels attached to six of these points are as 

follows: 



-133- 



0 - No Proficiency 

1 - Elementary Proficiency 

2 - Limited Working Proficiency 

3 - Minimum Professional Proficiency 

4 - Full Professional Proficiency 

5 - Native or Bilingual Proficiency 

If proficiency substantially exceeds the minimum requirements for the 
level involved but fails short of performance required at the next higher 
level, d "plus" is attached to a candidate's proficiency level, 

4» Mastery Status — Besides the eleven proficiency levels, an impor- 
tant distinction is made between persons scoring 3 and above and those 
scoring below 3. For purposes of this paper, I call persons receiving 
scores 3 and above "masters" because there are certain professional 
rewards in the U.S. Foreign Service for proficiency a't the 3 level and 
above. I call others "nonmasters. " (Disagreements between examiners 
that affect the "mastery status" of persons are far more serious than 
disagreements that do not.) 

5^ Testing Team — Cons ists of two raters, one known as examiner and 
one known as interviewer.- 

The interviewer is usually a native speaker of the language being 
tested and has received training in conducting FSI best interviews. 
The examiner is linguistically oriented in one or more foreign languages, 
including the one being tested. He or she is in charge of the adhiinis- 
tration of the test. This responsibility includes instruct ing -the 
interviewer on the line of questioning, setting hypothetical role-playing 
situations, supplying stimuli for conversation, and discussing the test 
results with the candidate. 

•The examiner and the interviewer have equal voices in rating a test. 
They vote on the results of a test. If their opinions differ by half a 
point, the lower grade is awarded. If their opinions differ by a full 
point, they submit their test, tape, and notes to arbitration by the head 
of the testing unit, 
f 

I n t erv iewe r^s did not always have an equal voice in the grading 
decision; rating was added to their duties just prior to this study. 
The results of this study for them must be considered in light of the 
novelty of the task. 



Procedure 

Txaminers. and interviewers in French, German, and Spanish listened 
individually to tapes of fifty tests (oral interviews) and rated them 



-134- 



independent ly .2 The complete list of part icipants is included in 
Appendix A. In total, we had six in French, four in German, and eleven in 
Spanish. Four to six tapes at each of the eleven proficiency levels were 
selected for use in the study, with the exception of level 5, where only 
two or three examples per language were selected. By allowing the number 
of tapes to vary, we prevented the participants from determining a pro- 
ficiency level based on an expected number of cases. 

. Some tapes had to be withdrawn from the study for lack of acoustic 
fidelity. The final count of tapes used in the study was as follows: 
French-'-f if ty , German— forty-six , and Spanish— forty-eight . 

Several raters did not judge every test. Others gave more than one 
rating to some tapes. Fortunately, the numbers of times these events 
occurred was very small. Rather than disqualify the raters, the inves- 
tigator suppliec" the average of gradbo given by other raters to fill 
the gaps. 

In total, five raters did not rcate a complete set of tapes. The 
situation was as follows: 



Rater 



Number of Tapes Rated 



French, D 
German, C 
German, D 
Spanish, I 
Spanish, 3 



49 
45 
45 
47 
45 



The ratings were completed in two time periods: 

1974 - Examiners 1977 - Interviewers 

French - raters A and 6 
German - raters A and D 



French - raters C, D, E, and F 
German - raters C and D 



Spanish - raters A, B, C, D, Spanish - raters F, G, H, I, 
and E J, and K 

Results and Discussion . 

In our first analysis, we correlated the ratings of each pair of 
examiners across the approximately fifty tapes. The correlations between 
the p^irs of examiners for the French, German, and Spanish raters are 



2rwenty-five of the interviews were recorded at FSI and twenty-five at 
thf7 CIA Language School as part of a joint project between the two 
schools. 



i.7 



11 



-135- 



reported in Tables 1, 2, and 3. (The ratings data from which the corre- 
lations were computed are reported in Appendix B.) It is clear from 
the tables that there was a nigh level of agreement among the raters. 
Correlations between their ratings in all cases exceeded .82, with the 
average correlation .91. 

The correlations reported in Tables 1, 2, and 3 are even more im- 
pressive when one considers that the tapes presented each rater with a 
possible range of eleven choices of ratings for each test (the more 
possible choices of ratings, the more room for disagreement among the 
raters). The high correlation coefficients show that there was sub- 
stantial agreement among the raters as to the" criteria. 

Correlation tables aie an interesting by-product but not the central 
thrus*- of this study. For our purposes, we were more interested in the 
kinds and degrees of di sagreements--whether raters tended to assign 
approximately the same ratings, or whether some were overly generous and 
others overly strict. 

TABLE 1 



Pair-Wise Correlations* of French 
Testers* Ratings of the Tapes 



Rater 




Rater 








8 


C D 


E 


F 


A 


.95 


.92 .92 


.93 


.93 


B. 




.92 .92 


.90 


.92 


C 




.94 


.89 


.93 


D' 






.92 


.95 • 


E 








.96 






TABLE 2 








Pair-Wise 


Correlations* of 








Ccrman Testers 


' Ratings of the 


Tapes 





Rater Rater 

BCD 



A .09 .93 .93 

B ■ .87 .80 

C .90 



*F'earr,on prixluct -moment corrolntion coefficients. 



ERIC 



-136- 



TABLE 3 



Pair-Wise Correlations* of 
Spanish Testers^ Ratings of 'the Tapes 



Rater 



Rater 





B 


C 


D 


E 


F 


G 


H 


I 


3 


K 


A 


.95 


.95 


.96 


..92 


.94 


.88 


.91 


.89 


.94 ■ 


.89 


B 




.96 


.96 


.95 


.92 


.92 


.94 


.89 


.94 


.91 


C 






.96 


.91 


.91 


.90 


.89 


.92 


.94 


.91 


D 








.93 


.93 


.91 


.91 


.94 


.95 


.91 


E 










.90 


.87 


.95 


.85 


.91 


.90 


F 












.87 


.88 


.87 


.91 


.94 


G 














.84 


.88 


.92 


.82 


H 
















.83 


.92 


.91 


I 


















.90 


.87 


J 




















.90 



*Pearson product-moment correlation coefficients. 



Tables 4, 5, and 6, corresponding to the French, German, and Spanish 
raters' data, respectively, summarize several pieces of pertinent data for 
the purpose of this study. For the French raters, the average percentage 
of ratings in agreemd'nt or tolerable disagreement was 92 percent. The 
average percentage of times raters agreed on a candidate's mastery status 
was 92 . percent. Average percentage of agreement for the Spanish raters 
was 87 percent and agreement on mastery status was 94 percent. 

Fable 7 shows that the errors in proficiency level determination that 
do occdr were, for the most part, not patterned. Only one rater was 
consistently more generous, and one was consistently more severe. 

What does it all moan? We would obviously like to have perfect 
agreement, but every improvement has its price; 

There are several known ways to increase reliability: reduce the 
number of points in the scale, reduce the number of raters, lengthen 
testing time. If we reduce the scaJe, we sacrifice inf orm^..tion. If we 
reduce the number of raters, we migh t /bverburden-±hose_yyhqi do -test and 
thus introduce a further error component. If we increase testing time, we 
increase the cost. < 



-137- 



TABLE 4 

An Analysis of Proficiency Level 
Assignments for Each Pair of French Raters 



Rater 
Pair 


Number 

of 
Tapes 


Percentage of Ratings 
Perfect ^ Tolerable 
Agreement Disagreement 


in: 

Total 
Agreement 


Ident ica 1 

Mas terv 
did 
b tat us 


»A,B 


50 


78 


16 


94 


94 


A,C 


50 


76 


20 


96 


96 


A ,D 


49 


52 


35 


87 


88 


A,E 


50 


74 


16 


90 


92 


A,F 


50 


74 


14 


88 


88 


B,C 


50 


6h 


22 


86 


94 


B,D 


49 


60 


24 


• 84 


90 


B,E 


50 


64 . 


24 


88 


86 


B,F 


50 


70 


20 


90 


90 


**C,D 


49 


51 


37 


88 


88 


*»C,E 


50 


78 


18 


96 


92 


»»C,F 


50 


62 


28 


90 


88 


*»D,E 


49 


55 


37 


92 


88 


*»D,F 


49 


57 


35 


92 


88 


*»E,F 


50 


64 


24 


88 


84 



Averages 



Examiners 78 16 94 94 

Interviewers 61 30 91 88 

Actual Teams 67 . 22 B9 91 

All Raters Combined 69 23 92 92 



a 



Perfect agreement = 'Percent of ident ical rat ings of a tape by two 
raters, e.g., rater A's "3" = rater B's "3" or rater A's "3.5" = rater 
B's "3.5." 

'^Tolerable disa,]reement = Percent of ratings of a tape by two raters 
differing by .5 point across whole numbers, e.g., rater A's "3.5" - rater 
B's "4.0." 

c 

Total agreement = "Perfect agreement" plus "tolerable disagreement." 

^Identical mastery status = Percent of times that two raters agree in 
their mastery status determination. 

*Examiners. 
**Interviewers. 

o 

ERIC 



-138- 



TABLE 5 



An Analysis of Proficiency Level Assignments 
for Each Pair of German Raters 



Rater 
Pair. 


Number 
. ' of 
Tapes 


Percentage of Ratings 
Perfect ^ Tolerable 
Agreement Disagreement 


in: 

Total 
Agreement 


Identical 
Mastery^ 
Status 


*A,B 


45 


49 


36 


85 


87. 


A,C 


45 


62 


24 


86 


87 


A,D 


45 • . 


71 


■ 16 


87 


87 


B,C 


45 ■ 


51 


22 


73 


96 


B,D 


45 


56 


24 


80 


84 


••C,D 


45 


93 


07 


100 

L 


100 



■ Averages 

Examiners 49 

Interviewers . 93 

Actual Teams 60 

All Raters Combined 67 



36 
07 
22 
22 



85 
100 
82 
89 



87 
100 
89 

- 92 



Perfect agreement = Percent of identical ratings of a tape by two 
raters, e.g., rater A's "3" = rater B's ■'3" or rater A's •■3,5" - rater 
B's "3.5." ' 

^Tolerable disagreement = Percent of ratings of a tape. by two raters 
differing by .5 point across whole numbers, e.a.. rater A's "1 5" - 
rater B's "4.0." ' y ' j.j . 

'^Total agreement = "Perfect agreement" plus "tolerable disagreement." 

'^Identical mastery status = Percent of times that two raters agree in 
their mastery status determination. 

•Examiners. 
**Interviewers. 



-139- 



TABLE 6 

• An Analysis of Proficiency Level 
Assignments for Each Pair of Spanish Raters 



Pair 


Nufnbe r 

or 
Tapes 


Percentage of Ratings 
Perfect Tolerable 
Agreement Disagreement 


in: 

Total 
Agreement 


Identical 
Mas be i-y^ 
Status 




48 


73 


. 23 


96 


98 


♦ A P 




82 


12 


94 


98 




/i □ 


73 


23 


96 


94 


♦ A r 


/i □ 


73 


19 


92 


94 


A r 


/i □ 


71 


15 


86 


94 


A P 


/■ 0 

ao 


67 


21 


88 


96 


A U 

A ,n 


ao 


75 


10 


85 


92 


A T 
A ,1 


/i "7 

47 


66 


19 


85 


88 


A 1 


/i 


64 


27 


91 


92 


A 1/ 


4o 


58 • 


23 


81 


88 


D f L 


4o 


79 


17 


96 


98 


D ,U 


/i Q 

4o 


65 


25 


, 86 


94 




4o 


71 


25 


96 


98 


□ r 


4o 


62 


21 


83 


96 


n p 


/i Q 

4o 


69 


15 


84 


94 


D ,n 


4o 


65 


17 


82 


98 


b ,1 


/i "7 

4 / 


66 


25 


91 


87 


P T 


4-? 


67 


24 


91 


98 


p t/ 
D ,K 


/i Q 

4o 


42 


33 


75 


90 


*p n 


4o 


73 


19 


92 


96 


*p r 


4d 


75 


17 


92 


94 


p r 


/■ 0 


62 


25 


87 


94 


r p 




67 


21 


88 


96 


r M 


/i Q 


71 


15 


1 


96 


P T 


/i "7 

4/ 


74 


13 


87 


87 


P 1 


Aj 


76 


16 


92 


96 


C,K 


48 


52 


21 


73 


90 


*D ,E 


48 


71 


19 


90 


94 


D.F 


. 48 


65 


19 


84 


90 


D,G 


48 


52 


33 


85 


• 92 


D,H 


48 


77 


17 


94 


96 


D.I 


47 


57 


17 


74 


85 


D,J 


45 


58 


31 


89 


91 


D,K 


48 


58 


15 


73 


88 


E,r 


48 


62 


19 


81 


98 


E,G 


48 


67 


19 


86 


94 


E,H 


48 


67 


23 


90 


94 


E.I 


47 


66 


15 


81 


92 


E.J 


45 


73 


20 


93 


94 


E,K 


48 


50 


27 


77. 


90 



ERIC 



(Continued on page 140) 



ill 



-140- 



TABLE 6 (cont.) 



Averages 

Examiners 74 

Interviewers 58 

Working Pairs 64 

All Raters Combined 65 



37 
45 
42 

42 



92 
81 
85 
87 



Rater 
Pair 


Ni jmhp r 


P 0 r* r^o PI f" 

r c 1 Ut; 1 1 L 


age or nac mgs 


in: 


Identical 


of 

T apes 


Pp rfpnf" 
r c 1 1 CL> L 

A nppompn^ ^ 


\ oxeraoie ^ 
Disagreement 


1 otai 

A rf*^ v% x» 

Agreement 


. Mastery . 
Status 




48 


^0 




"70 ■ 


94 




48 


0 / 


1 7 


O/i 


94 




47 


55 
✓ ✓ 


91 


/o 


92 


**F,J 


45 


62 






98 


♦*F,K 


48 


79 


12 


91 


88 


♦*G,H 


48 


58 


17 


75 


94 


♦*G,I 


47 


55 


30 


85 


B8 


**G,J 


45 


. 69 


18 


87 


96 


**G,K 


48 


48 


23 


71 


90 . 


♦*H,I 


47 


55 


23 


78 


85 


♦*H,J 


45 


62 


24 


86 


96 


♦♦H,K 


48 . 


52 


21 


83 


92 


♦*I,J 


44 


57 


23 


90 


. 86 


♦*I,K 


47 


40 


34 


72 


83 


J,K 


45 


62 


22 


84 


87 



96 
91 
93 
94 



Perfect agreement = Percent of identical ratings of a tape by two 
raters, e.g., rater A's "3" = rater B's "3" or rater A's "3.5" = rater 
B's "3.5." 

^Tolerable disagreement = Percent of ratings of a tape by two raters 
differing by .5 point across whole numbers, e.g., rater A's "3.5" - 
rater B's "4.0." 

"'Total agreement r "Perfect agreement" plus "tolerable disagreement." 

^Identical mastery status = Percent' of times that two raters agree in 
their mastery status determination. 



♦Examiners. 
**Interviewers. 



-lAl- 



Rater 
Pair 



TABLE 7 

Direction of Errors among Pairs of f^aters 



Number Total 

of Number of 

Tapes Disagreements 



Number of Times: 
First Rater Second Rater 
Higher Higher 



FRENCH RATERS 



*A,B 


50 


24 


12 


12 


A,C 


50 , 


24 


12 


12 


A,D 


49 


29 


9 


20 


A,E 


50 


28 


10 


18 


A,F 


50 


23 


17 


6 


B,C 


50 


30 


13 


17 


B,D 


49 


30 


8 


22 


,B,E 


50 


34 


9 


25 


'B,F 


50 


28 


18 


10 


**C,D 


49 


29 


10 


19 


**C,E 


50 


28 


10 


18 


**C,F 


50 


28 


20 


8 


**D,E 


49 


11 


16 


6 


**D,F 


49 


31 


26 


5 


**E,F 


50 


33 


26 


7 








GERMAN RATERS 




*A,B 


45 


33 


19 


14 


A,C 


45 


. 29 


12 


17 


A,Q 


45 


31 


20' . 


11 


B,D 


45 


27 


18 


9 


B,D 


45 


29 




10 


**C,D 


45 


9 


2 


7 


♦Examiners. 










♦♦Interviewers. 









.00 
.00 
4.18 
2.28 
5.26 
.52 
6.54. 
7.52 
2.28 
2.80 
2.28 
3.84 
4.54. 



14.22 
10.92 



.74 
.86 

2.30 
3.00 
2.78 
2.78 



t = Significant at p < .05 level, 



ERIC 



-142- 



Table 7 (cont.) 





Ml imh P r 

1 ILJIIIU C 1 


i U L a J. 


iNUinDer 


or lifDRS! 




Rater 


nf 
U 1 


llUMIUci Ul 


r irsL nauer 


Second Rater 




r d X i 




Tapes 


uisagreerne n ls 


Higher 


Higher 








CD AM T CU 

orMlNibn 


■DA TTDC 




•< 

1 


*A R 

ft J u 






1 n 


in 
lU 


nn 


*A r 


/ifl 


Zo 


11 


16 


.92 


*A ,D 


*40 




1^ 


1 / 




*A F 


Afl 




1 n 

lU 


lU 


nn 
• UU 


A r 




9 A 
Z^ 


o 
o 


16 


2.66 


A n 


Afl 

^o 


97 


1 9 

IZ 


1-? 


.z4 


A H 




91 
ill 


1 /l 

14 


/ 


1 .81 






9 A 


14 


10 


.66 


A 1 




91 


□ 


1^ 


1 on 
1 . ZU 


A 


Afl 


71 
^1 


o 


9 t 
Z-> 


11 . 64^ 








1 9 
IZ 


16 


.56 




Afl 

*40 


9A 

^H 


1^ 


11 


1 ^ 
. 16 


*R r 


AQ 


9A 


11 


13 


.16 


R r 


Afl 


97 
Z / 


1 9 
IZ 


1-? 


.34 


B G 


Afl 


9fl 


1 9 
IZ 


1 ^ 

16 


. -?6 


B ,H 


Afl 


29 




1 A 


n A 


B I 


A7 


95 


1 A 


11 


. ^6 


B, J 


45 


21 


9 
✓ 


1 p 

lO 


^ . UU 


B,K 


48 


3fi 


Q 

O 




6 . Jty 


*C ,D 


Afl 

HO 


9n 


□ 
O 


1 9 

IZ 


n 
.O 


*c!e 


48 


21 


7 


1 A 


9 7A 
Z . 


C.F 


48 


34 

✓ H 


1 9 


* 99 
ZZ 


9 QA 
Z . 7^ 


C,G 


Afl 




1 1 
xl 


99 
ZZ 


^ . 6 / 


C,H 


48 


25 


1 7 


1 9 
IZ 


nA 

. U^ 


CI 


47 


19 


9 


1 n 

lU 


n/c 

. UO 


C, J 


AS 


91 


J 


1 p 
lo 


in 7 1*- 

lU . 1 


C,K 


Afl 


3S 


-? 




on 

Z7 . /6' 




Afl 


1 9 


Q 


1 n 
lU 


n/c 
. Uo 


n F 


Afl 
HO 


9n 
zo 


Q 


1 o 

ly 




D,G 


48 


37 


1 fl 


1 Q 

17 


n9 
. uz 


D,H 


4fl 


21 


1 A 


7 


9 7 7 


D,I 


47 


25 


13 


12 


.04 


D,J 


■ -45 


25 


10 


15 


.50 


D,K 


48 


32 


7 


25 


5.06^ 


E,F 


48 


23 


8 


15 


2.12 


E,G 


48 


26 


11 


15 


.62 


E,H 


48 


26 


10 


16 


1.38 


E,I 


47 


23 


14 • 


9 


1.08 


E,J 


45 


17 


7 


10 


.54 


F,K 


48 


30 


6 


24 


10.80^ 



*Examiners. 
**Interviewers . 

^ = Significant at p < .05 level. 



-143- 



Table 7 (cont.) 





Nil irT^K\Q 

iNumDc r 


1 ocai 


Numbsr of 


Times: 






Or 


Number or 


First Rater 


Second Rater 




i\ a L CI o 


T a o o 
1 dpcS 


Dis ag reenfients 


Higher 


Higher 




r ,u 


HO 




19 


16 


.26 


r ^ n 


HO 


33 


24 


9 


6.82 


**F r 


A 7 




21 


11 


3.12 


*»F,3 


45 


29 


14 


15 




**F,K ■ 


48 


26 


5 


7 1 


Q Q Q 

y mOJ 


»»G,H 


48 


34 


20 


14 


1.06 


»»G,I 


47 


32 


19 


13 


1.12 


»»G,J 


45 


22 


10 


12 


.18 


**G,K 


48 


36 


11 


25 


5.44 


»»H,I 


47 


. 29 


13 


16 


.30 


**H,J 


; 45 


26 


8 


18 


3.85 


**H,K 


48 


34 


4 


30 


19. 88^ 


»»I,J 


45 


23 • 


6 


17 


5.26 


»»I,K 


47 


36 


8 


28 


5.56^ 




45 


26 


6 


30 


7.54'' 



''Examiners, 
^♦Interviewers. 

^ = Significant at p<.05 level. 



In actual practice the rate of agreement is higher because all 
possible pairs do not constitute testing teams. The average rate of 
agreement for actual testing teams in French and Spanish was 89 percent 
and 85 percent. The average rate of agreement on mastery status was 91 
percent and 93 percent. Since the examiner is in charge of the test, the 
rate of agreem&nt among examiners is especially important. These rates in 
French and Spanish were 94 percerlt and 92 percent for all tests. The 
agreement on mastery status among French and Spanish examiners was 94 
percent and 96' percent. (French interviewer C rated as reliably as the 
French examiners anrf has since been moved to examiner status.) 

The ' results of the experiment in German were somewhat different. 
The more reliable German raters were the interviewers rather than the 
examiners. The interviewers agreed with each other 100 percent in both 
areas of interest in this study.' They never varied from each other more 
than a "plus.'' The figures for the two examiners are lower; they agreed 
with each other generally at the rate of 89 percent and they agreed ci 
mastery status at the rate of 87 percent. 



The explanation Tor the , diTf^erence. probably lies in the history 
or the raters' association with eadh other. The Germ&n examiners- never 
worked together but rather succeeded each other in the job with no 
overlap. The German interviewers; on the other hand, have provided 
consistency in testing for more than ten years. / i 

If we can draw any general conclusions from this study, they would be 
these: at the very least, 84 percent W examinees would receive the s^ame 
rating from two independent raters. Oi\, more realistically, in similarly 
stringent situations 94 percent of the\ examinees would receive the same 
scores rrom dirrerent French raters. I^nety-three percent would receive 
Che same scores rrom dirrerent German raters. Agreement on mastery statqs 
would be 94 percent, 96 percent, and 93 percent. \ 

There is rurther reason to believe that the rate or agreement i\ 
higher in practice. There is no problem with lack or acoustic ridelity\ 
in a race--to-race interview. Grades are never decided by one rater alone \ 
(as was done in this study) but rather by two raters in consultation. ^ 
Further, in a live test situation, each member or tfie testing team can 
gather the evidence necessary ror a sound judgment, whereas in the experi- 
ment each later had to make do with someone else's testing technique.. 

Some or the most interesting and revealing tapes rrom this point or 
view were those that received a broad range, or/ scores. Some or them 
involved dirricult decisions or ractor weighting, such as near native 
riuency and pronunciation against serious grammatical errors; or a 
vocabulary inventory ^nd comprehension worthy or an educated speaker, but 
without structural control; or good use or difficult grammatical reatures, 
but a vocabulary liberally strewn with inappropriate anglicisms. Tests 
like these do not easily fit one derinition, yet a decision in terms 
or ability to do a job must be made. 



-145- 



Appendix A 
List of Participants 



Vicente Arbeiaez 
William Van Buskirk 
Monique Cossard' 
Susana Framinan 
Catherine Hanna 
C . Cleland Harris 
Pauie Horn 
Isabel Lowery 
Joann Meeks 
Juan 3os6 Molina 
Alain Mornu 



Margarethe Plischke 
Robert Salazar 
Harlie Smith 
Patricio Solis 
Blanca Spencer 
Marina Wilie Stinson 
Marie-Frangoise Swanner 
Jack Ulsh 
Agustin Vilchec 
Allen I. Weinstein 



11 



Test 
Number 

1 

2 

3 

4 

5 

6 

7 

8 

9 
10 
11 
12 
13 
14 
15 
16 
17 
18 
19 
20 
21 
22 
23 
24 

25 . 

26 . 
27 
28 
29 
30 
3i 
32 
33 
34 
35 
36 
37 
38 

39 - 
40 
41 
42 
43 
44 
45 
46 
4/ 
• 48 
. 49 
50 

♦Investigator 



Appendix B 
F rench Ratings 
Rater 



A , , 


B 


C 


D 


E 


F 


1.0 


1.0 


1.5 


r.o 


1.5 


1.0 


3.5 


4.0 


3.5 


4.0 


3.5 


3.5 


3.5 


4.0 


3.0 


4.0 


3.0 


3.5 


.5 


.5 


1.0 


.5 


1.0 


.5 


4.0 


4.5 


3.5 


4.0 


4.0 


. 4.0 


2.0 


2.5 


3.0 


3.5 


■3.0 


2.5 


2.5 


2.0 


2.0 


"2.5 . 


2.0 


2.0 


1.0 


1.0 


1.0 


1.0 


1.0 


1.0 


4.0 


4.5 


3.0 


3.5 


3.5 


3.0 


1.5 


2.0 


2.0 


2.0 


2.0 


1.5 


2.5 


2.0 


. 2.0 ^ 


3.0 


2.5 


" 2.5 


3.5 


3.5 


3.0 ' 


3.0 


3.5 


4.0 


3.5 


4.5 


5.0 ■ 


5.0 


5.0 


4.5 


.5 


.5 


•".5 


1.5 


.5 


.5 


3.5 


2.0 


3.0 


3.0 


3.0 


2.5 


4.0 


4.0 


4.0 


4.5 


4.0 


4.5 


.5 


.5 


1.0 


1.0 . 


.5 


1.0 


3.0 ■ 


3.5 


3.0 


3.5 


3.5 


3.0 


5.0 


5.0 


4.5 


5.0 


5.0 


5.0 


2.0 


2.5 


2.5 


2.5 


, 3,0 


1.5 


3.0 


3.0 


3.0 


3.0 


3.5 


2.5 


2.5 


2.5 


2.5 


2.5 


2.5 


2.0 


1.5 


1.5 


1.5 


2.0 


1.5 


1.0 


.5 


.5 


1.0 


.5 


1.0 


. .5 


^ 3.0 


3.0 


3.0 


2.5 


3.0 


2.0 


4.-0 


3.5 


4.0 


3.5 


4.0 


4.0 


1.5 


1.5 


1.0 


1.5 


1.5 


1.0 


3.0 


3.5 


3.0 


3.0 


3.5 


3.0 


5.0 


5.0 


5.0 


5.0 


5.0 


5.0 


1.5 


1.5 


1.5 


2.5 


1.5 


1.5 


3.0 


3.0 


3.0 


2.5 


2.5 


2.0 


2.0 


1.0 


2.0 


1.5 


1.5 


1.0 


2.5 


2.5 


2.5 


3.0 


2.5 


2.5 


1.5 


1.5 


1.5 


1.5 


1.5 


1.0 


1.0 


.5 


1.^ 


.5 


1.0 


.5 


3.5 


3'. 5 


3.0 


3.0 


3.5 


3.0 


4.'0 


4.0 


4.0 


4.0 


3.5 


4.0 


.- 5.0 


5.0 


5.0 


4.5 


5.0 


5.0 


3.0 


2.5 


3.0 


3.0 • 


3.5 


3.0 


1.5 


1.5 


2.0 


2.5 


2.5 


1.5 


1.0 


1.0 


1.5 


2.0 


1.5 


1.0 


.5 


.0 


1.0 


.8* 


.5 


.5 


1.5 


1.0 


2.0 


1.5 


1.5 


1.5 


2.5 


2.0 


2.5 


2.5 


2.5 


2.0 


4.0 


3.5 


4.0 


4.0. 


3.5 


4.0 


3.0 


2.5 


2.5 


3.0 


3.5 


2.5 


2.5 


2.5 


2.5 


3.0 


. 2.5 


3.0 


4.5 


4.5 


4.0 


5.0 


'4.5 ' 


4.5 


1.5 


2.0 


1.5 


2.5- 


.1.5 


1.0 


2.0 


2.5 


2.0 


2.5 


3.0 


2.0 


supplied 


data (average 


score ) . 









11 



-147^ 



German Ratings 



Test 
Number 

1 
2 
4 
5 

. 6 
7 
8 
9 

10 

11 

13 

14 

15 

16 

17 

18 

19 

21 

22 

23 

24 

25 

26 

27 

28 

29 

30 

31 

32 

33 

34 

35 

36 

37 

38 

39 

40 

41 

42 

43 

44 

45 

46 

47 

49 

50 



Rater 



A 

1.0 
2.5 \ 
4.0 , 
5.0 ' 
3.0 
5.0 
2.5 
.5 
1.5 
4.0 
2.0 



5 
4 
2 
3 
1 
2 
3.5 
3.0 
'4.5 
3.0 
1.0 
4.0 
4.0 
1.5 
3.0 
1.5 
3.5 
3.0 

.5 
2.5 
4.5 
1.5 
2.5 
1.5 
4.0 
I.O 
2.0 
4.5 
3.5 
1.0 

.5 
4.5 
3.5 
1.0 
4.0 



B 

1.5 
3.0 
4.0 
4.5 
2.5 
4.0 
2.0 
.5 
2.0 
3.5 
2.0 
4.5 
4.5 
2.0 
2.0 
3.0 
2.0 
3.5 
3.0 
4.0 
3.0 
1.0 
.0 
.5 
,0 
.5 
?.0 
2.0 
2.5 
1.0 
3.0 
4.5 
2.0 
3.0 
1.5 
3.5 
1.5 
3.0 
3.5 
4.0 
2.0 
.5 
4.0 
3.0 
1.5 
3.5 



4. 
3. 
1. 
1. 



.5 
2.0 
2.0 

5.0 
2.5* 
4.5 
2.0 
.5 
2.0 
3.0 
2.0 
5.0 
4.0 
2.0 
2.5 
1.5 
2.0 
3.0 
3.0 
4.0 

* y 

1.0 
3.5 
3.5 

.5 
2.5 
1.5 
2.0 
3.0 

.5 
2.0 
3.0 

1.0 ■ 

3.0 
1.5 
4.0 
1.5 
2.0 
4.0 
3.0 
1.0 
.5 
5.0 
3.0 
1.0 
3.0 



D 

1.0 
2.0 
2.5 
5.0 
3.0* 
4.5 
2.0 
.5 
2.0 
3.0 
2.0 
5.0 
4.5 
2.0 
2.5 
1.5 
2.0 
3.0 
3.0 
4.0 
2.5 
1.0' 
3.5 

4.0 
1.0 

2.0 
1.5 

2.0 

3.0 
.5 

2.0 

3.0 

1.0 

3.0 

1.0 

4.0 

1.0 

2.0 

4.0 

3.0 

1.5 
.5 

5.0 

3.0 

1.5 

3.0 



♦Investigator supplied data (average score), 



ERIC 



-148- 



Spanish Ratings 
"Test 2 Rater 



Number 


A 


B 


C 


D 


E 


F 


I] 


1 


1.5 


1.5 


1.5 


1.0 


1.0 


2.0 


.1,0 


2 


2.0 


1.5 


1.5 


1.5 


1.5 


2.0 


2.0 


3 


1.5 


1.5 


1.5 


1.5 


1.5 


1.5 


1.0 


4 


4.5 


5.0 


4.5 


4.5 


4.5 


4.0 


5.0 


5 


2.5 


2.5 


2.0 


2.0 


2.0 


.2.5 


2.0 


6 


4.5 


5.0 


4.0 


3.5 


5.0 


3.5 


5.0 


7 


1.0 


1.5 


1.0 


.5 


1.0 


1.0 


1.0 


8 


4.0 


3.0 


3.0 


4.0 


3.0 


2.5 


3.0 


9 


4.0 


4.0 


4.0 


4.0 


4.0 


4.0 


4.5 


10 


2.0 


2.0 


2.0 


2.5 


2.0 


2.5 


2.0 


11 


2.0 


2.0 


2.5 


2.0 


2.5 


2.5 


2.5 


12 


.5 


.5 


1.0 


.5 


.5 


1.0 


.5 


13 


2.0 


2.0 


1.5 


2.0 


2.0 


2.0 


.5 


14 


1.0 


1.0 


1.0 


1.0 


1.0 


2.0 


1.5 


15 


2.0 


2.0 


2.0 . 


2.5 


2.5 


2.0 


2.0 


16 


3.0 


3.5 


3.5 


3.5 


3.5 


3.5 


4.0 


17 


2.0 


2.0 


2.0 


2.0 


2.0 


2.5 


1.5 


18 


3.5 


3.0 


3.0 


3.5 


3.5 


3.5 


3.5 


19 


.5 


1.0 


.5 


.5 


.5 


.5 


1. 0 


20 


2.5 


2.5 


2.5 


3.0 


3.0 


2.5 


2.5 


21 


.5 


.5 


-.5 , 


.5 


.5 


.5 


.5 


22 


1.0 


.5 


1.0 


1.0 


.5 


1.5 


.5 


23 


5.0 


5.0 


5.0 


5.0 


5.0 


4.0 


4.5 


24 


4.0 


4.5 


3.0 


3.5 


3.5 


4.0 


3.5 


25 " 


1.0 


1.0 


1.0 


1.0 


.5 


1.0 


2.0 


26 


3.0 


3;0 


3.5 


3.5 


3.5 


3.0 


3.0 


27 


3.5 


3.5 


3.5 


3.5 


4.5 


4.5 


3.0 


28 


2.5 


3.0 


2.5 


2.5 


4.0 


3.0 


2.5 


29 


2.5 


2.5 


2.0 


3.0 


2.5 


2.5 


2.5 


31 


3.0 


2.0 


2.0 


2.5 


2.5 


2.5 


2.5 


32 


4.0 


4.5 


4.0 


3.5 


3.0 


3.5 


4.5 


33 


1.5 


1.5 


1.5 


2.0 


1.5 


2.0 


1.5 


34 


1.0 


1.0 


1.0 


1.0 


1.0 


2.0 


1.5 


35 


2.5 


2.5 


2.5 


2.0 


2.5 


2.5 


1.5 


36 


4.5 


5.0 


4.5 


4.5 


4.5 


4 . 0 


4.0 


37 


1.5 


1.5 


1.5 


1.5 


1.5 


2.5 


2.0 


38 


4.0 


3.5 - 


3.5 


4.0 


4.0 


4.0 


4.5 


39 


3.0 


2.5 


2.5 


2.5 


3.0 


3.0 ■ 


3.0 


40 


2.0 


1.5 


2.0 


2.0 


2.0 


2.0 


2.5 


41 


4.5 


5.0 


5.0 


3.5 


5.0 


4.5 


5.0 


42 


.5 


^5 


.5 


1.0 


.5 


1.0 


.5 


43 


. 1.0 


1.0 


1.0 


1.0 


.5 


1.5 


1.0 


44 


3.0 


3.0" 


3.0 


3.0 


3.0 


3.0 


3.0 


46 


2.0 


2.0 


2.5 


2.5 


2.0 


2.5 


2.0 


47 


1.5 


1.0 


1.0 


2.0 


1.5 


1.5 


1.5 


48 


2.5 


2.0 


2.0 


2.0 


2.0 


2.5 


4.0 


49 


2.0 


2.0 


2.0 


2.0 


2.5 


2.0 


2.5 


50 


4.5 


4.5 


4.0 


4.0 


4.0 


4.0 


4.5 



H I 3 K 

1.5 1.5 1.5 2.0 

2.0 1.5 1.5 2.0 

1.5 1.5 2.0 2.0 

4.0 4.5 4.5 4.5 

2.0 2.0 2.0 3.0 

3.5 5.0 5.0 4.5 

1.0 1.0 1.0 1.5 

4.0 2.5 3.5 3.5 

3.5 4.0 4.0 4.0 

1.5 2.0 2.0 2.0 

2.0 2.0 2.5 2.5 

.5 2.0 1.0 1.5 

2.5 1.0 2.0 2.5 

1.0 1.5* .5 2.0 

2.5 1.5 2.5 2.5 

3.0 3.5 4.0 4.5 

2.0 1.5 2.0 3.0 

3.5 2.0 3.5 3.5 

.5 .5 .5 .0 

2.5 3.0 2.5 2.5 

.5 .5 .5 .5 

1.0 1.0 1.0 1.5 

4.0 5.0 4.5 4.5 

3.0 2.5 3.5 4.0 

1.0 1.5 1.0 1,5 

3.5 2.5 3.5 3.5 

3.5 3.5 4.0 4.5 

3.5 3.5 4.0 4.0 

3.0 2.5 2.5 3.0 

2.5 2.5 2.5 2.5 

3.0 4.5 4.5 3.0 

1.5 1.5 1.5 2.0 

1.0 1.5 1.5 2.0 

2.0 2.5 1.5 3.0 

4.5 4.5 4.5 4.5 

1.5 1.5 1.5^^ 2.0 

4.0 3.5 4.0 4.0 

2.5 3.0 2.5 3.0 

2.5 2.0 2.0 2.0 

3.5 5.0 5.0 4.5 

.5 .5 .5 1.5 

1.0 1.0 1.0* 1.5 

3.5 3.5 3.0 3.0 

2.0 2.5 2.3* 2.5 

1.5 1.0 1.5 2.0 

2.0 1.5 3.5 2.0 

2.0 2.0 2.5 2.0 

3.5 4.0 4.5 4.5 



♦Investigator supplied data 



(average score). 



-149- 



References 

Hambleton, R, K., and NovLck, M. R. "Toward an Integration of Theory and 
Method for Criterion-Referenced Tests." Journal of Educational 
Measurement , 10 (1973): 159-70. 

Millman, J. "Criterion-Referenced Measurement," In Evaluation in Educa - 
tion; Current Applications , edited by W, J, Popham. Berkeley, 
CaJ ^ : McCutchan, 1974. 

Swamlnathan, H.; HambJeton, R. K.; and Algina, J. "Reliability of 
Criterion-Referenced Tests: Decision Theoretic Formulation." 
Journal of Educational Measurement , 11 (1974): 263-68, 

Tolllnger, 5., and Paquette, F, A. The MLA Foreign Language Proficiency 
Tests for Teachers and Advanced Students. A Professional Evaluation 
and Recommendations for Test Development . New York: The Modern 
Language Association, 1966. 



INDEPENDENT 'RATING IN ORAL PROFICIENCY INTERVIEWS 



John Quinones 
Central Intell igence Aqency 



INDEPENDENT RATING IN ORAL PROFICIENCY INTERVIEWS 
■John ^uiRones 

Background 

Since 1972 the Language School of the Central Intelligence Agency 
has been using independent rating and averaging of testers' ratings to 
determine oral proficiency levels. In this paper I will discuss the 
development and use of a graphic rating scale that is used in conjunction 
with the verbal descriptions of the FSI Absolute Language Proficiency 
Ratings (Rice, 1959; FSI, 1963; Clark, 1972;' Wilds, 1975). 

The interview technique currently employed at the Language School 
is conceptually similar to the one developed at and used by the Foreign 
Service Institute of the U.S. Department of State. The two agencies 
differ, however, in thref^ asp.<?cts. The testing team at the FSI consists 
of a native speaker of the language being tested and a scientific linguist 
familiar with the language. At the FSI, unlike the CIA, the S-rating is a 
combination of speaking and listening comprehension factors. The S-rating 
is always determined by averaging at the CIA (in languages in which there 
are at least two testers), while at the FSI there are several methods 
employed, including averaging when feasible. 

Prior to 1972 the determination of proficiency levels in interview 
tests at the Language School was handled differently by different panels. 
In most cases one tester would suggest a rating and if the other member 
of the team disagreed, they would discuss the test until the discrepancy 
was resolved. In some cases the testers would vote on paper and if the 
ratings could be averaged (for example, one tester voted "2" and the other 
"3") they would combine the ratings. The resolution by discussion was 
sometimes time^-consuming and occasionally led to unpleasant interpersonal 
confrontations, especially when one tester was inflexible in the inter- 
pretation of the level definitions. 



Characteristics of the New Rating System 

We felt the new rating system should contain at least the following 
features : 

1. The system should allow for the differentiation of speaking 
and unders tanding s ince the Language School gives separate 
ratings for these skills. 

2. The degrees of proficiency in each skill should be 
represented on the regular eleven-point scale, from 0 to 5, 
with "pluses" for levels 0 through 4.. 

3. The system should allow each tester the opportunity to 
contribute fully to the de termir at ion of the final rating. 



-154- 



4. The system should Tacilitate immediate feedback to managers 
on the effectiveness of the testing iprogram. 

5. The system should permit immediate feedback to testers as to 
intertester reliability while at the same time decreasing or 
eliminating the possibility of interpersonal conflict. 

6. The system should incorporate a graphic representation of 
the concepts of range and " plus . " 

7. The system should allow easy averaging of two or more 
ratings. 

While most of the above desirable features did not require much 
clarification or discussion, we had to specify the notion of range in 
order to incorporate it into the graphic scale. I think that most 
practitioners of the FSI oral interview characterize the levels on the 
eleven-point scale as ranges. It is thus very com;Tion to hear, in the 
discussions that follow tests, statements such as "Tt's a low 5," or -'It's 
a classical 3," or "It's a very high 2+ but not quite a 3." Because 
this notion is important in the assignment of J jvela, it is graphicsrlly 
represented on the scale that we developed (f'igure'j.). Testers can, 
therefore, make these finer judgments • and have them count in the combined 
rating. 

The notion of the "plus" was also made part of the graphic scale. 
In the current FSI system, all the numerical ratings except 5 may be 
modified by a "plus"' to indicate that, the examinee substantially exceeds 
the requirements for a level but fails to meet the requirements of the 
next higher level, especially in either grammar or vocabulary. The "plus" 
range is thus represented on the graphic scale as having a value of .60 
to .99. 



Rating of the Test 

After the testers have finished the oral interview, they proceed to 
rate the examinee without consultation, using the rating sheets provided 
for this purpose (Figure 1). The independent judgment of each tester is 
expressed by drawing, a lin^ ( — ) across each rating scale (speaking and 
understanding.) at the point he or. she feels best indicates the examinee's 
ovXerall proficiency in the skill. The testers are encouraged to make 
full use of the ranges in the scale since it is essential for the purpose 
of averaging the scores. 

Determination of the Final Ratinq 

After each >.ester ha^ |ecided on his or her rating, the rating 
sheets, properly identified, are turned over to a testing aide for 
scoring. fhf^ combined rating for a given skill is determined by using a 
ruler marked in. tenths. In cases in which the tester's mark coincides 
with a marking on, ^he ruler, the lower tenth is always assigned. (The 



-155- 

FIGURE 1 

Language Proficiency Ra,^.ings 
(Oral-Aural Skills) 



Examinee 
Examiner 



Language 



Test Number 
Date 



SPEAKING 



UNDERSTANDING 



- 


range 




range 


"4" 


range 




range 


1 


range 




range 


112" 


range 




range 


It 2 II 


range 


ItQ^tl 


range 


"G" 


range 



REMARKS: 



) 



-156- 

rationale for this rule is that we believe that in general the 
consequences of overrating are more serious than the consequences of 
underrating.) Conversion tables are shown in Table 1. 

This method of scoring permits not only the averaging of twp scores 
but the averaging of any number of scores. The desirability of combined 
or averaged scores is supported by the fact that in oral interview tests 
(assuming that the testers are rigorously trained, as is presently the 
case in the Language School) the average is both a more reliable and a 
more accurate ■ (valid) rating than the sole judgment of the best rater 
This has been documented in studies on clinical judgment and decision 
making. 

TABLE 1 

Cc-iversion Tables for 
Language Proficiency Ratings 



Range of 
Overall Scale Values for 
Rating Single Judges 



5 

ei+ 
4 

5+ 

3 

2+ 

2 

1+ 
1 

Of 
0 



00 
60 
00 
60 
00 
2.60 
2.00 
1 .60 
1 .00 
0.60 
0.00 



4.99 
4.59 
3.99 
3.59 
2.99 
2.59 
1.99 
1.59 
0.99 
0.59 



Range of 
Scale Values for 

Two Judges 
(Summed Ratings) 



10.00 
9.20 
8.00 
7.20 
6.00 
5.20 
4.00 
3.20 
2.00 
1.20 
0.00 



9.99 
9.19 
7.99 
7.19 
5.99 
5.19 
3.99 
3.19 
1.99 
1.19 



Range of 
Scale Values for 

Three Judges 
(Summed Ratings) 



15.00 
13.80 
12.00 
10.80 
9.00 
7.80 
6.00 
4.80 
3.00 
1.80 
0.00 



14.99 
13.79 
11.99 
10.79 
8.99 
7.79 
5.99 
4.79 
2.99 
1.79 



Range of Range of _ Range of 

Scale Values for Scale Values for Scale Values for 

Four Judges Five Judges Six Judges 

(Summed Ratings ) (Summed Ratings) (Summed Ratings) 

I 



5 


20.00 






25 


.00 






30 


.00 






4+, 


18.40 - 


19 


.99 


23 


.00 - 


24 


.99 


27 


.60 - 


29 


.99 


4 


16.00 - 


18 


.39 


20 


.00 - 


22 


.99 


24 


.00 - 


27 


.59 


3+ . 


14.40 - 


15 


.99 


18 


.00 - 


19 


.99 


21 


.60 - 


23 


.99 


3 


12.00 - 


14 


.39 


15, 


,00 - 


17 


.99 


18 


.00 - 


21 


.59 


2+ 


10.40 - 


11, 


.99 


13, 


.00 - 


14 


.99 


15, 


.60 - 


17 


.99 


2 


8.00 - 


10, 


,39 


10. 


.00 - 


12 


.99 


12. 


,00 - 


15 


.59 


1+ 


6.40 - 


7. 


,99 


8. 


,0Q - 


Q 

✓ , 


.99 


9. 


,60 - 


11, 


.99 


I ' 


4.00 - 


6. 


39 


5. 


00 - 


7, 


.99 


6. 


00 - 


9, 


,59 


Of 


2.40 - 


3. 


99 


3. 


00 - 


4. 


,99 


3. 


no - 




,99 


0 


• 0.00 - 


2. 


39 


0. 


00 - 


2. 


99 


0. 


ou - 




^-0 



^ 1 r ■'■ 

ERIC ^ 



References 

Cxark, John L. Foreign Language Testing; Theory and Prac t-ice^ 

Philadelphia: Center for Curriculum Development, 1972. 

Foreign Service Institute. "Absolute Language Proficiency HafirT-/ 
Washington: Foreign Service Institute, 1963. 

Jones, Randall L. "Testing, Language Proficiency in the United utric::, 
Government." In Testing Language Proficiency , edited by Rand^l4. L. 
Jones and Bernard Spolsky, pp. 1-7. Arlington, Va.: Center "^^ 
Applied Linguistics, 1975. 

Rice, Frank A. "The Foreign Service Institute Tests Language Prof iciencv . ' 
Linguistic Reporter 1 (1959): 2, 4. 

Wilds, Claudia P. "The Oral Interview Test." In Testing Languag e 
Proficiency , edited by Randall L. Jones and Bernard Spolsky, pp, 
29-38. Arlington, Va.: Center for Applied Linguistics, 1975. 



THIRD RATING OF FSI INTERVIOVS 



Pardee Lowe, Jr. 
Central Intelligence Agency 



. , 13: 
ERIC 



THIRD RATING OF FSI INTERVIEWS''- 



Pardee Lowe, Jr. 

This study has a long and school-wide genesis. It originated with 
an instructor's comment that Third Raters tend to place a candidate's 
speaking proficiency lower than do the interviewers who actually conduct, 
the evaluation. Because this view is rather prevalent at the CIA Language 
School (LS) and, further, because the LS has maintained fairly complete 
Third Rater records dating back three years, this study seemed both 
feasible and desirable and, indeed, has proven enlightening, given the 
testing folklore to which ve often unquestioningly subscribe. 

Before turning to the methodology and the attendant results, the 
several types of raters need to be defined. At the LS, each language 
candidate's speaking proficiency is simultaneously, but independently, 
evaluated by two interviewers direct ly after a "live" oral interview. 
These interviewers will henceforth be referred to as Original Raters. 
UnJer certjjin conditions the opinions of one or more Third Raters will be 
ca Med for. This might occur when there is a discrepancy between the 
ratings of the Original Raters, when the test score is disputed by the 
test candidate, or when the sample or the elicitation technique used 
to arrive at the sample strikes either of the Original Raters, their 
supervisor, or the chief of testing as unusual and worthy of closer 
scrutiny. For present purposes any rater who was not a member of the 
Original Rater team is regarded as a Third Rater. Thus, it is possible 
to speak of the first Third Rater, second Third Rater, and so forth, so 
long as these raters have listened to the same interview. 

A second distinguishing characteristic of the Third Rater is that he 
or she is limited to evaluating an audio (only) tape recording of the 
interview. Furthermore, it should be emphasized that Third Rater data 
represent only the deviant cases (as defined above and which occurred 
from the beginning of 1975 through November 1977). The cases requiring 
three raters amount to less than 25 percent of all testing done in the 
languages reported. Security considerations preclude citing the actual 
percentage, which is considerably less than 25 percent. A more complete 
study would perforce include the vast majority of the evaluations in which 
there were no substantial disagreements between the Original Raters or 
any other reason to question the findings. The present study makes no 
pretense of being a thorough inter- and/or intrarater reliability and 
validity study. 



[he present paper would not have been possible without the ai/d and 
support of the LS staff and instructors. I wish to express my gratitude 
to the LS instructor who raised the question about severity error in Third 
Raterste scoring and to Michael Gibbons and John Quinor^es, who passed the 
question on and have read and given helpful suggestions on the present 
paper. Above all I wish to thank Robert J. Vincent, without whose help 
the statistical portions of the paper could not have been carried out. 

i5r 



-162- 



Several arguments have been put forth in support of the hypothesis 
that a Third Rater tends to be a more severe judge than the Original 
Raters: (l) Third Raters listen to tapes; they are not present at 
the creation of the speech sample and therefore are not privy to the 
"richness" of the "live" performance, (2) A Third Rater has more time 
to concentrate on listening to the candidate's erxors, for in the test 
itself the Original Raters have their hands full crchestrat ing the 
elicitation of the speech sample via an interactive question/answer 
interview. Although they may take notes during the process, much reliance 
upon memory goes into arriving at their final assessment. Thus, one 
might conclude that the Third Rater's increased opportunities to concen- 
trate on the candidate's performance might unveil more errors, with the 
consequence of a lower rating, (3) A Third Rater has the means to repeat 
(play back) any portion of the interview to check for errors, further 
increasing awareness of the number and types of errors, (4) A Third 
Rater may be asked to write out detailed comments and examples so that 
the test might be discussed more fully among the supervisor, the original 
testers, and the Third Raters, Again, the type and extent of these 
comments may lead one to predict a lower rating from the Third Rater, 

A word or two at this point is in order on the matter of the ratings 
themselves. Figure 1 is an example of t;.^- language proficiency rating 
sheet on which each rater records his final assessment of the candidate's 
speaking prowess. For the candidate in question. Original Rater 1 scored 
the performance 2,8; Original Rater 2 was- much more lenient (3,8), The 
large discrepancy led to a Third Rater being pressed into service (scoring " 
the candidate 3.3, which, coincidentallv , just happens to be the average 
of the scores set by the Or iginal Raters) , Each score is indicated along 
the speaking scale, as well as in the box to the left of the scale. 

The official rating of each candidate is expressed as a ranye of 
proficiency (the eleven-point FSI rating scale), as depicted in Figure 1, 
In terms of the present data. Rater I's score fell in the "2+" range, 
while Rater 2's score reached the "3+" range. Original Raters are 
considered to have arrived at the same proficiency evaluation if and 
only if eac; marks the same range, regardless of the actual numerical 
scale score. Since these raters* scores fell within different (indeed, 
discontinuous) ranges, a third rating seemed warranted. As noted, the 
third evaluation fell in the intermediate ("3") range. 



Hypotheses 

It should be clear at this point that each candidate's speech sample 
routinely receives two types of evaluations from each Original Rater: a 
numerical score and its corresponding FSI scale rating (encompassing a 
range;of numerical scores), 2 Consequently it is quite possible for the 



^For a fuller understanding of this process, see the John Quinones paper 
on independent rating in this volu.ne. 



-163- 



FIGURE 1 

Language Proficiency Ratings 
(Oral-Aural Skills) 



Examinee 
Examiner 



Language 

Test Number 
Date 



SPEAKING 



5 




range 






range 




"4" • 


range 


RATER 2 - 




range 


RATER 3 - 
3 


ii^ii 


range 


RATER 1 - 




range 


2 


II 2 '» 


range 




"i+" 


ranQi;' 


1 




range 







UNDERSTANDING 



'0+" range 



0" range 



-164- 



Original Raters' numerical scores not to agree, but ir they fall within 
the same scale range the candidate will, in the last analysis, be judged 
similarly by both raters. Since the test of one type of criterion measure 
would not be complete without an evaluation of the other, two sets of 
hypotheses were established for testing, as follows: 



Group 1 (FSI Scale Ratings) 

a. Third Rater evaluations fall in an FSI range above 
those of either of the Original Raters. 

b. Third Rater evaluations fall in an FSI range below ' 
those of either of the Original Raters. 

c. Third Rater evaluations fall in an FSI range 
intermediate to those of the Original Raters. 

d. Third Rater evaluations are equal to at least one 
original rating. 



Group 2 (FSI Numerical Ratings) 

e. The average Third Rater numerical rating is equal 
to the average of the Original Raters* numerical 
ratings. 

r. The average Third Rater numerical rating is equal 
to the official numerical rating. 

g. The average of the Original Raters' numerical rating 
is equal to the offical numerical rating. 



Experimental Sample 

LS records from 1975 through November 1977 were culled for each 
instance of a Third Rater evaluation. Sufficient numbers of such 
evaluations were found in French, Spanish, German, Russian, Chinese, 
Japanese, and Portuguese. In all, 163 examples were recorded, but for 
a variety of reasons, some of the analyses were restricted to a maximum 
of 149 cases. 



^Analyses were restricted to data combined across all languages. 
Analyses were conducted on both individual-language and grouped ■ dat a. 



-165- 



Procedure and Results 

A series of chi-square analyses was conducted to test the Group 1 
(FSI scale ratings) hypotheses. The chi-square test is ideally suited 
for testing whether a statistically significant difference exists between 
an obser ved number of events falling in each of (Several categories and 
an expected number based on the hypothesis that tfiere are no systematic 
differences in the number of events in each of the categories. Table 1 
summarizes the tests and attendant results associated with each of the 
Group 1 hypotheses. 

Results of the first test indicate that the frequencies with which 
the ratings fell into the several categories (expressed as hypotheses s 
through d) were not equally' distributed. In other words, there were 
systematic or nonchance differences in the manner in which the rating 
frequencies were distributed across the categories. 

The second test addresses whether or not the frequency with which 
Third Raters grade below the Original Raters is comparable to the 
frequency with which they do not rate below. The answer, quite clearly, 
is that the number of times a Third Rater is more severe than both the 
Original Raters is more than offset by the number of times he is more 
lenient than at least one of them. As before, such dif^^erences are 
significantly nonchance. 

Test 3 is concerned with the situation in which the Third Rater 
scores differently than both Original Raters. In other words, those cases 
wherein the Third Rater agreed with at least one of the Original Raters 
have oeen excluded from consideration. Here again there are significant 
differences in the frequencies among the categories, leading to the 
question posed in Test 4, which, paraphrased, reads: "When Third Raters* 
scores differ from *:hose of at least one of the Or iginal^" Raters , ao Third 
Raters more often than not score lower?" The answer, in terms of FSI 
scale ratings, is ' ost definitely no. Expressed another way, when Third 
Raters' scores do, in fact, differ from those of a least one of the 
Original Raters, there is as much chance that the Third, Ra^er will score 
higher than at least one of the Original Raters as that he will score 
lower. 

However, Test 5 indicates that significantly more Third Raters scored 
lower than both Original Raters than scored higher than-^both. The same 
was true in the comparison of the number of instances where Third Raters 
scored lower than both Original Raters to the situation where Third 
Raters* scores were in an FSI range intermediate of those of the Original 
Raters (Test 6). 

Lastly, results from Test 7 show that for every instance where the 
Third Rater scored below both Original Raters, more than twice as often 
this rater ,scored higher than one of them. 



-166- 



TABLE 1 



Tests of Group 1 Hypotheses 
(FSI Scale Ratings) 



Null Hypotheses: 

1. Equal numbers of cases 
fall in each category; 
N = 149. 



2. Equal numbers of cases 
fall in each category; 
N = 149. 



3. When Third Rater's 
score was different from 
those of both Original Raters, 
equal numbers of cases fall 

in each category; N = 72. 

4. When Third Rater's 
score was different from 
those of both Original Raters, 
equal numbers of cases fall 

in eqch category; N = 72. 

5. When Third Rater's 
score was different from 
those of both Original Raters, 
equal numbers of cases fall 

in each category; N = 55. 

6. When Third Rater's 
score was intermediate or 
lower than those of both 
Original Raters, equal numbers 
of cases fall in each 
category; N = 52. 

7. When Third Rater's 
score was lower than both 
or equaled at least one of 
Original Raters, equal numbers 
of cases fall in each category; 
N = 112. 



Hypotheses* 
bed 



a+c 

(+d) 



OBSERVED 
EXPECTED 



35 



17 



77 



37.25 37.25 37.25 37.25 — 



Chi-square = 61.55; df = 
reject null hypothesis. 

OBSERVED — 35 



3; p < .01; 



114 



EXPECTED 



74.5 



74.5 



Chi-square = 41.89; df 
reject null hypothesis. 



1; p < .01 



OBSERVED 
EXPECTED 



20 
24 



35 
24 



-i7 
24 



Chi-square = 7.75; df : 
reject null hypothesis, 

OBSERVED ., — 



2; P < .05; 



EX-PEC TED 



35 



36 



37 



36 



Chi-square r 0.06; df = 1; p > .05; 
accept null hypothesis. 



OBSERVEf^ 
EXPECTED 



20 



35 



27.5 27.5 



Chi-square r 4.09; df = 
reject null hypothesis. 



1; P < .05; 



OBSERVED 
EXPECTED 



35 



26 



17 



26 



Chi-square = 6.23; df = 
reject null hypothesis. 



OBSERVED 
EXPECTED 



35 
56 



1; P < .05; 

77 . -. 

56 



Chi-square = 15.75; df = 1 ; p < .01; 
reject null hypothesis 



•^Hypotheses; 



(a) 
(b) 
(c) 
(d) 



Third Rater above both Original Raters. 
Third Rater below both Original Raters. 
Third Rater intermediate . 

Third Rater equal to at least one Original Rater 

It;- 



-167- 



In short, statistical analysis of FSI rating scale data does not 
support the contention that Third Raters are more severe assessors of 
speaking proficiency than are their Original Rater counterparts. To the 
contrary, Third Raters are as lenient or more so than at least one of the 
Original Raters better than 75 percent of the time, at le- 3t as far as FSI 
scale ratings are concerned. 

Discussion thus far has been restricted to the FSI rating scale 
data. It was mentioned earlier that several hypotheses were generated 
concerning the comparability of the numerical ratings arrived at by 
the Original and Ihird Raters. To that end, an additional series of 
statistical analyses was conducted. 

Table 2 summarizes the results on a language-by- language as well as 
an across- language basis. The sample size in these analyses . al' . 163 
(including the 149 reported earlier). 

The earlier analyses dealt with the number of times ar, even^ 
occurred. The present situation has to do with actual scores, and, 
for that reason, another type of analysis is in order. To test for 
differences in numerical scores between and among the various types of 
raters, a statistical technique called a _^-test was applied to the dat-^. 
Like the chi-square test, the _^-test determines whether the differences 
between qroups (actually, pairs of groups) are statistically signifi- 
cant rather thart attributable to chance variation. 

Attention is directed first to the overall results at the bottom of 
Table 2. The average Third Rater numerical rating across all languages 
studied was 2.50 (a "2" on the FSI scale, since 2.6 would be required to 
reach the "2+" level). The rating arrived at by averaging the scores of 
the Original Raters was 2.62 (a •■2+"). The difference between these 
numerical ratings was found by the _t-test to be highly significant (and 
thus rejects hypothesis e). ~; 

A comparable analysisxwas concerned with the Third Rater/cf f icial 
rating relationship. Although the differer ce in average numerical ratings 
was found to be very significant (2.5 for the Third Raters; 2.38 for the 
official rating, rejecting hypothesis f), both sets of numerical ratings 
fell within the "7" rating scale. \ . , 

The third overall compari oX looked at the differences between 
average Driqinfrl Rater numerical scores and the official ratings 
(exDre,-;sed as r.umerical scores). Once again there were highly significant 
differences in favor of the Drirjlnal Raters. Moreover, the corresponding 
FSI rating scores differed es well ("2+" vs. "2," respectively), rejecting 
hypothesis rj. 

A look at thf individual language data reveals that what was true 
for. the across-lnnqunge data need not hold for any particular language. 
Although French and German Ihird and Original Raters disagreed beyond 
chance levels Uhird Raters more severe) and similar, but not statis-' 
tically significant differences were found for Spanish, Russian, and 



-168- - 



■ TABLE 2 

Tests of Group 2 Hypotheses 
(FSI Numerical Ratings) 



Languages 


Average 
FSI 

Ratings 


, Third 
Rating 


Average 

Original 

Ratinq 


Third 
Ratinq 


Official 
Ratinq 


Average 

Original 

Ratinq 


Official 
Ratinq 


French 


Numerical 


2.56 


2.77 


2.56 


2.47 


2.77** 


2.47** 




(Scale) 


'*2 " 




"2 " 


"2 " 


"Z+" 


"2 " 


Spanish 


Numerical 
(Scale) 


2.54 

"2 " 


2.67 
"2+" 


2.54 

112 H 


2.40* 


2.67 
"2+" 


2.40** 
fi2 ti 


German 


Numerical 


2.98 


3.16* 


2.98 


2.94 


3.16 


2.94** 




(Scale) 


"2+" 


tf^ ti 


"2+" 


"2+" 


11^ II 


"2+" 


Russian 


Numerical 


2.12 


2.25 


2.12 


2.03 


2.25 


2.03** 




(Scale) 


"2 


"2 " 


"2 " 


"2" 


"2 " 


"2 " 


Chinese 


Numerical 


2.66 


2.61 


2.66 


2.47 


2.61 


2.47 




Ibcale ) 


"2+" 


"2+" 


"2+" 


"2 " 


"?+" 




Japanese 


Numerical 
(Scale) 


2.27 

"2 


2.12 

If 9 If 


2.27 , 
"2 " 


2.00* 
"2 " 


2.12 
"2 " 


2.00 
"2 " 


Portuguese 


Numerical 
(Scale) 


3.87 
"3+" 


3.64 
ii'j II 


3.87 
"3+" 


3.46** 
11^ If 


3.64 

11^ ft 


3.46 

Hjlf 


TOTAL 


Numerica 1 


2.50 


2.62** 


2.50 


2.38** 


2.62 


2.38** 




(Scale) 


"2" 


"2+" 


"2 " 


"I 


"2+" 


"2 " 



* Probability of a difference this large due to chance less than .05. 
♦•Probability of a difference this large due to chance less than .01. 



-169- 



Chinese, the opposite held true for Japanese and Portuguese (but the 
differences cculd easily be attributable to chance factors). 

Third Rater vs. o^'Ticial rating comparisons revedled that Spanish, 
Japanese, and Portuguese Third Raters arrived at significantly higher 
numerical ratings than tunned up in the official ratings. Each of the 
remaining languages followed suit, but the dif f erences \f ai led to reach 
conventional levels of significance. 

Finally, average Original Rater numerical scores exceeded the 
official ratings in every case, with the differences for French, Spanish, 
German, and Russi^^n highly significant. 

\ 

With few exceptions, then, the numerical rating data indicate that 
there are highly significant dif f erences among the official ratings 
(2.38), the Third Rater scores (2.50), and the average of the Original 
Raters (2.62). When these rcores are converted to FSI scale ratings, 
however, both the official and Third Rater results are found to be more 
conservative ("7"^ those for the Original Raters ("2+"). Therefore, 

the hypothesis that Third Rater^ grade more severely than the Original 
Raters is supported (in terms of both numerical scores and their 
equivalent scale ratings). This test contradicts the findings up to 
now. However, it mus: be remembered that arithmetic means are more 
influenced by a wide discrepancy in scores. This test, therefore, 
reflects variations in ratings. • Since we cieal at the LS in FSI levels 
(each of which comprises a range of scores), this test has the least 
significance for LS scores. 

Third Raters tend to be more generous with their numerical scores 
than was reflected in tfie official ratings (although the corresponding 
scale ratings fell in the "2" range in both instances). 

The import of this study for the LS and others who may opt to use an 
independent rating system with Third Raters is that, with properly trained 
personnel, severity error in Third Raters need be only a minor problem. 
Restricting our comments to the FSI rating analysis above. Third Raters 
were as lenient or more po than at least one of the Original Raters better 
than 73 percent of the time- 



DETERMINING THE EFFECT OF UNCONTROLLED SOURCES 
OF ERROR IN A DIRECT TEST OF ORAL PROFICIENCY 
AND THE CAPABILITY OF THE PROCEDURE TO DETECT 
IMPROVEMENT FOLLCWING CLASSROOM INSTRUCTION 



f 
I 



Karen A. Mullen 
University of Louisville 



DETERMINING THE EFFECT OF UNCONTROLLED SOURCES OF ERROR IN A DIRECT 
TEST OF ORAL PROFICIENCY AND THE CAPABILITY OF THE PROCEDURE TO DETECT 
IMPRUvEMfNT FOLLOWING CLA'SSROOM INSTRUCTION 

Karen A* Mullen 



During the last few years, interest in direct testing of oral 
proficiency has grown. A number of research questions have been raised 
about the relationship between reliability and such variables as niethods 
of scoring, length of testing time, number of interviewers, and number of 
interviews. In addition, questions have been posed about the relationship 
between direct.and indirect tests of oral proficiency. I wish to present 
the results of a study under taken in one ESL program to determine the 
answer to yet two more quest ions , one concerning the effect of uncon- 
trolled sources of error in the procedure and the other involving the 
issue of whether the procedure can detect improvement in proficiency from 
one period to another. To allow for comparison between the FSI interview 
and the one described here, I will first note the context in which the 
study was conducted and then proceed to a description of the research 
design. I will then present the results and discuss the ways in which the 
oral interview may relate to Indirect tests of oral proficiency. 

At the time of this study, admission of a foreign student into an 
academic program at either the undergraduate or the graduate level at the 
University of Iowa was contingent upon academic eliglbil ity and a TOEFL 
score of at least 480. The only exceptions to this were Vietnamese 
applicants, who were generally admitted without proof of eligibility or a 
TOEFL score report. Students whose TOEFL scores were between 480 and 55G 
or who had no scares to report were referred to our ESL program for 
further proficiency evaluation and recommendation to the ESL program if it 
seemed warranted. 

part of this evaluation, an examinee was interviewed by two 
instructors for fifteen to twenty minutes. One of the interviewers took 
the major responsibility for conducting the interview and the other 
listened, occasionally interjecting questions to clarify a misunder- 
standing or to move the conversation along m a natural and informal way. 
The intent was to make the interview ^s much like a real-life conversation 
as possible". 5At the beginning, the examinee was made to feel comfortable; 
talk usually centered around the weather, details of getting to the 
interview, country of origin, length of stay m the United States, and 
so forth. The interviewer then tried to find a broad topic on which the 
examinee could speak with some authority for a period of time. Usually 
examinees were asked to tell about '.their families, education, academic 
interests, goals, opinions, i-mpress ions , and attitudes. Interviewers 
were told not to modify their syntax or rate of speaking unless it 
became apparent that examinees did not understand; When this occurred, 
interviewers rephrased their questions and attempted to continue the 
conversations. If it was apparent that examinees were able to hold their 
own, every attempt was made to give them the opportunity to demonstrate 
their full ability to engage in communicative dialogue. 



-174- 



Following the interview, the two instructors rated the examinee on 
five scales of proficiency: listening compreher .ion , pronunciation, 
fluency, grammar, and overall proficiency. Each scdle was represented by 
five continguous boxes of equal size, labeled poor, fair, good, above 
average, and excellent. Interviewers were instructed to put an "X" either 
inside the box or on the line between two boxes. These were later 
converted to numerical values (1 = poor, 2 = between poor and fair, 
3 = fair, 4 = between fair and good, ... 9 = excellent). Interviewers 
consult ed^ descriptions' for the five levels of proficiency for each of the 
first four scales when determining the level. Overall proficiency was 
based on a subjective composite of the other four scales. The rating form 
and the skill-level descriptions are given in Appendix A. 

To some degree, the procedure for assigning levels in this study 
differs from the FSI procedure. FSI interviewers are asked to make a 
global judgment first and then to fill- out a five-scale checklist:, with 
six intervals per scale. The global judgment on the FSI interview is not 
directly tied to any of the six intervals since the global judgment ranges 
from 0 to 5, with "pluses" in between. In this study, on the other hand, 
consideration of the four scales precedes overall judgment and the levels 
in each scale can be considered to be directly tied to the levels in the 
overall scale. Furthermore, unlike the case with the FSI interview, 
"vocabulary" is not one of the scales considered. 

Interviewers in this study were ESL teachers who had had formal 
training in linguistics and language teaching and had taught ESL for at 
least one year. Because of the number of students to tie interviewed and 
the time available for scheduling, the interviewers were randomly paired 
and assigned to interviews in two two-hour blacks, with a one-hour break 
between blocks, on ed^^h of three days. Examinees were randomly scheduled 
and assigned to the interviewing teams. No interviewer had ever met an 
examinee before the interview. 

Following a semester of instruction, the subjects were interviewed 
again under the same format. To ensure that no instructional bias would 
be introduced in the second interview, interview teams were assigned to 
interview people who had not been students in their classes. These 
teams also interviewed new students who were referred to the program for 
evaluation and possible recommendation to ESL classes. As a result, they 
were not able to distinguish old students ^rom new students. 

The first objective of the study was to determine the best estimate 
of reliability for each of the testing periods. Reliability can be 
defined in a number of different ways. For the purpose of this study, I 
shall assume that in a situation in which a rater is given the task of 
estimating the magnitude of a specified characteristic for a given person 
in a single performance: 

• (1) tpe mr,qnitude of the specified characteristic is constant; 
and 



-175- 



(2) the estimation of the specified characteristic by the rater 
consists of the constant magnitude just cited and an error 
of measurement that is due in part to the rater and in part 
to the conditions surrounding the measurement. 

(1) is the "true score" and (2) is the observed score. For any number of 
raters under the same conditions, I further assume that: 

(3) the true score of the person rated does not vary from rater 
to rater; ■ 

(4) the observed score of the person rated does vary from rater 
to rater; and 

(5) the best estimate or that part of the score chat varies from 
rater to rater is the mean error of measurement. 

For any number of people to be evaluated, it is assumed that: 

(6) the true scores will vary from person to person; y 

(7) the observed scores will also vary from person to person; 
and 

(8) the variance of the-observed scores, is due in part to the 
variance in the true, scores and in part to £he variance in 
the mean error of measurement. 

From (8) one may derive the equation: 

(9) variance of observed scores = variance of true scores + 
variance of mean error. 

If there were no variance in the mean error of measurement, the measure- 
ments would be 100 percent reliable. By the same token, the larger the 
variance in the mean error of measurement, the less reliable the measure- 
ments. Thus, the reliability of x. raters is a ratio (where x is the 
number of raters): 

, (10) variance of true scores 

variance of true scores + variance of fDean error of measurement 

An analysis of variance provides an estimate of the variance of the 
mean error of measurement; in terms of the total variation, it is that 
part that is due to the variation within people. An analysis of variance 
will also provide an estimate of the variance of the observed scores, 
i.e., the denominator in (10); it is that part of the total variation 
that is due to the variation among people. These two estimates will be 
sufficient for determining the reliability of measurements, where _x is 
the number of raters: 



1 :j 



-176- 



(Al) average variation between people - average variation within people 

average variation between people 

This estimate of reliability is biased since the average variation within 
people is affected by the number of people in the sample and the number of 
raters. Therefore., an adjustment must be made to produce an unbiased 
estimate: 



(A2 ) averaqe variation between people - m (average variation within people) 

average variation between people 

where m = (number of people) ( number of raters - 1) 

(number of people) (number of raters - 1) - 2 

In general, the unbiased reliability (A2) will be lower than the biased 
one (Al). The smaller the number of people in the sample or the smaller 
the number of raters, the larger the difference between Al and A2. For 
example, were 2 raters employed, it would require a sample of 2,000 
people for the difference between the two to be minimal. ' If the number of 
raters were increasf.'d to 3, a sample of about 1,^100 people would be 
required for the difference to be minimal. If only 15 subjects were to be 
rated, it would require 135 raters for there to be a minimal difference 
between Al and A2 . Naturally, the smaller the number of people and the 
smaller the number of raters, the greater the difference between Al and 
A2. Thus, for a small sample or a small number of raters or both, the 
unbiased reliability (A2) is the more appropriate statistic. 

The variance of the average error of measurement, as mentioned in 
(2), includes the variance due to the m£;in effect of raters as well as 
that due to uncontrolled errors. An analysis of variance -that partitions 
the within-people variation into these two components makes it pos.-^ible 
to further refine the estimate cf observed-score variat ion due to uncon- 
trolled errors of "measurement. In this respect, we may consider that 
the within-people variation, is composed of two subvariat ions; one is due 
to differences between raters and the other to errors not otherwise 
accounted for. We shall call this latter the residual variation. If 
we reconsider reliability in a way in which the effect, of raters is not to 
be considered a part of the error of measurement, we then have c new 
definition patterned after that of Al and A2. 

(Bl) average variation between peopl e - average residual variation 
average variation between people 

(B2) average variation between people - m (average residua l variation) 
averag? variation between people " 

where m = (number of people - l)(number of raters - 1) 
( nuffiber of people - 1) (number of r8;.8rs - I ) . - 2 



-177- 



Bl IS also known as Cronbach's alpha ; it . is a biased estimator. The 
addition ol m into the B2 formula makes adjustments for sample size and 
the number of raters. With large samples, the two values of m in A2 and 
B2 will not differ appreciably. With small samples, m in A2 will be 
smaller in B2 . In addition, if most of the within-people variation is due 
to differences between raters, the value subtractpd is smaller. This in 
combination with smaller m, will cause 82 to be greater than A2 . 82 is 
the more appropriate if the effect of uncontrolled sources of error is 
the primary focus. 82 is also directly comparable to the Pearson product- 
moment correlation since neither depends on differences du=? to raters. In 
addition, following the suggestion of Ebel (1951), this formula is the 
more appropriat e if decisions are based upon the average of the two 
ratings. The model upon which reliability is based is thus: 

'J.1) = TT^- + + 

Preliminary tests have shown that this model is appropriate. We may 
assume that the observed score is the sum of a true constant -ngnitude of 
the characteristic me^-.ired (^i), the effect of rater 'aj, and the 
error of measurement n.^.. Tukey's test for nonaddit ivity ^provides no 
evidence for the postuijcion of an interaction effect; that is, in all 
samples investigated, if one rater gives a higher rating than the other 
he or she will consistently do so across all subjects. There is no 
evidence for suspecting rater A's giving higher scores to some subjects 
and lower scores to others while rater G does the opposite. 

Table 1 shows the reliability of the mean of two measurements on each 
of the speaking , prof iciency scales for the nine samples of subjects 
evaluated in the first testing period and the fifteen samples from the 
second period. The chi-square tests in Table 2 show that, ^ith the 
exception of the overall scale in the second testing period, tie reli- 
cibilities of each testing period can be considered to be drawn r'rom the 
same population (p < .01). The mean reliability for each testinc period 
determined by weighting each reliability according to the sizR of the 
sample from which it was calculated, is shown at the bottom of Table 1. 

Nine of the rater pairs were the same for both testing periods 
Paired _t-tesr!:; indicate no significant (p < .01) difference in the mean 
reliabilities for the nine pairs in the two testing periods on any of the 
five scales (listening comprehension _^ = .68, pronunciation t_ = .58, 
fluency t = .86. grammar _^ = .91, overall _t = 1.02). Tl.e correlation 
between the reliabilities of the first and second testing periods for 
the nine pairs are not posil.iv^ly correlated and so may be treated as 
independent samples. When tte reliabilities of the six additional rater 
pairs in t.ie second testinc, period are included, t-tests indicate no 
significant (p < .01) difference in the mean rr- liabi lit ies for the 
two testing periods for all pairs (listening comprehension t = .88 
pronunciation t_ = .I.A5, fluency t_ = 1.24. grammar t_ = 1.39. overall t = 
1.61). Since the mean reliabilities are not significantly different. The 
means of the mean reliabilities, determined again by weighting each 
reliability according to the size of the sample (N = 115. N r 152), are as 
follows: lisleninq comprehensicn r .883. pronunciat ion = \ 781 , fluency - 
.01(^. grammar = .796. and overall = .847. 



\ 

TABLE 1 . 

'Reliability of the Mean of Two Measurements on Each of the Speaking 
Proficiency Scales for Rater Pairs for the Two Testing Periods 

^N..^-^-^ — 2*--—- - - - - 7 • - , — -•- [ 

p r Listening _ Pronunciation Fluency Grammar Overall " 

r^^lr _ First. Second First, Second First Second First Second First Second First Seconc 

.''^ .471 .853 .673 .893 .813 

.000 .709 .000 .883 .000 

.422 .953 .835 .913 ' .844 .968 

.926 ■ .693 .640 .675 .931 .720 

.891 .801 .858 .738 .835 .713 

.840 .868 .874 ^ .844 .909 .854 

.889 , .645 .973 .000 .925 .536 

.600 .872 .870 .780 .785 .864 

.879 .802 .864 .934 .818 .900 

.847 — .769 — ,.818 

.906 — .869 — .978 

— .323 — .766 — .611 

— -.844 — .703 — .844 

— .709 — .786 — .533 
.372 — . .564 — .242. 



V/^ightt 

H^ans 115 152 .851 .819 .775 .787 .851 .787 .826 .771 .885 

^- _ - . ^ . 

^ ■ « 



) 

I't) 

w 
1^ 



15 
17 
10 
17 
25 
M 
7 
5 
5 



7 

11 
15 
10 
12 
15 
7 
6 

14 
15 
.7 
7 

10 
10 
6 



.923 
.833 
.430 
.898 
.850 
.323 
.695 
.980 
.897 



.000 
.736 
.926 
.890 
.419 
.851 
.781 
.630 
.945 
.890 
1.000 
.680 
.874 



.773 
.863 
.758 
.917 
.656 
.822 
1.000 
.362 
.241 



.720 



.953 
.850 
.909 
.747 
.724 
.872 
.860 
.583 
.868 
.583 
.956 
.155 
.381 
.715 
.000 



ERIC 



-179- 



TABLE 2 . 

Results of Chi-Square Tests on. Reliabi lities from the First 
Testing Period (number of interview pairs = 9) and the Second 
Testing Period (number of interview pairs = 15) 



Testing Period 



First Second 
Scale (^j 3 9) (f^ ^ ^c^^ 



Listening 


9.54 


25.50 


Pronunciat ion 


13.13 


22.62 


F luency 


8.80 


24.78 


Grammar 


8.17 


22.75 


Overal 1 


2.77 


37.74* 



^Significant at p < , 01 . 



Sin-^e estimates of population parameters are acceptably high, it 
appears that errors of measurement in the observed scores do not loom 
large. Thus, interest now focuses on the question of whether direct 
testing 'of speaking proficiency jnder the conditions described is capable 
of showing improvement in performance from one testing period to another. 
One hundred seven subjects were tested in botf;i periods. Table 3 shows 
the standard deviations and mean scores on each of the scales for the two 
testing periods. There is no significant (p < .01) difference between 
raters on any of the scales. The reliahilit ies ^or this set of subjects 
are within range. Table 4 shows that the mean of the mean scores on each 
of the scales in the second testing period is significantly higher than it 
is in the first period (p < .05). The difference is about one-half a 
level for each scah . 



TABLE 3 

Results of I-Iests on Mean Performance as Measured by Two Raters on Five Scales of 
Speaking Proficiency for Iwo Testing Periods Four Months Apart (N : 107) 







First Testing Period 






Second Testino Period 






Scale 


Rater 


Mean 


S.D. 


t 


P 


r 


■ i^dLgi nean 


c n 
b.U. 


t 


P 


r 


Listening 


1 




1 111 








^ 1 6.56 


1.63 








2 


6.03 




-.77. 


.43 


.852 


-.52 


.60 


.869 










2 6.61 


1.66 






1 




L .,''1 








1 6.17 


1.21 








Pronunciation 


2 


5.56 


1.51 


.56 


.57 


.776 






1.61 


.10 


.827 










2 6.02 


l.;l 


■ / 

/ 


Fluency 


1 


5.50 


1.53 








1 6.01 


1.44 








2 


5.5] 


1.60 


-.59 


.55 


.840 


-{.24 


.21 


\846 










2 6.14 






Grammar 


1 


5.56 


1.30 








1 5.94 


1.31 










5.46 


1.37 


1.19 


.23 


.872 


-.86 


.39 


.812 












2 6.03 


1;'25 




Overall 


1 


5.65 


1,35 








1 6.08 


,1.25 








2 


5.53 




1.42 


.15 


.867 


.00 l.QQ 


.847 










2 6.08' ' 


1.27 







EMC 



/ 



-181- 



TABLE 4 

Results of T-Tests on Mean Performance as Measured by Five 
Scales of Speaking Proficiency foi Two Test Periods Four 
Months Apart (N = i07) 



Scale 


Test 
Period 


S.D. 


Mean 


_^ 


P 

(one-tailed) 


Listening 


1 
2 


1.74 
1.63 


5.99 
6.59 


-4.30 


.000 


Pronunciat ion 


I 
2 


1.40 
1.25 


5.59 
6.09 


-4.53 


.000 


Fluency 


1 
2 


1.56 

1.50 


5.54 
6.07 


-3.96 


.OOP 



1 1.33 5.50 
Grammar -4.21 .000 

2 1.28 5.99 



1 1.40 5.99 

Overall -4.37 .000 

2 1.26 6.08 



-182- 



V. 



TOEFL scores for the two testing periods were available for 16 
the 107 subjects. Improvement as measured by direct testing is evident 
for these subjects on four of the five scales (p < .05), as indicated in 
Table 5. Such improvement is also evident on three of the- five parts of 
TOEFL, the most relevant of which is the listening comprehension subtest. 
This subtest is considered to be a reasonably good predictor of oral 
proficiency, and one would expect improvement as measured by the dii*ect 
test to also show up on' the TOEFL subtest. However, since the latter 
requires the subject to read as well as listen, one might suspect that 
improvement in reading proficiency accounts for the difference in perfor- 
Uiance on the listening subtests for the two periods. In fact, the TOEFL 
reading subtest does not indicate significant improvement from the earlier 
testing period to the later one. Therefore, the change in performance on 
listening comprehension appears to be due to a real change in aural 
proficiency. The results from the listening comprehens ion scale of the 
interview corroborate this conclusion. Moreover, if it is true that the 
listening comprehension subtest is an indirect measure of other oral 
skills, such as pronunciation or fluency, one would expect improvement in 
pronunciation and fluency in the interview. This is the case. 

Likewise, if TOEFL and the interview are two ways of measuring the 
same thing and if TOEFL shows greater control over grammar, one would 
expect the interview to reflect this. By the same token, were TOEFL to 
show no improvement in English structure, this would show up in the 
interview as well. However, it is clear that this is not the case. It 
seems that the interview is measuring some aspect of control over English 
stru.Hure that TOEFL is not, and vice versa. Given the f^'^-t that the 
rOEFL structure subtest gives subjects the opportunity to make grammatical 
judgments after thinking about the possible choices and the interview does 
not, it may be that the TOEFL structure subtest is a measure of passive' 
control ever English grammar and that the interview is a measure of active 
control. If there is a difference between the two, one would expect 
improvement to be less likely in the latter. This interpretation receives 
support from the present study and may serve to explain why the TOEFL 
structure subtest shows improvement and the grammar scale of the interview 
does not. 

To examine this claim further, let us examine the relationship 
between these two. types of knowledge. For those who have studied English 
as an academic subject in their home countries and have had very little 
opportunity to use and apply knowledge of the language in their day-to-day 
activities outside the classroom, pjissive control over the language will 
exceed active control. If t^he interview is to .be considered a means of 
testing active control and TOEFL is a means of :.jsting passive control, 
and if passive control is greater than active control, one would expect no 
high degree of correlation between TOEF I and the interview in the first 
testing period. However, after a period of language instruction in the 
language to be learned, and after a period time in which the subject 
is forced to conduct most of his day-to-day activities in the second 
language, one would expect greater active control as well as greater 
passive control.. Moreover, one would expect a higher correlat ion between 



-183- 



TABLE 5 

Resur'.j: of T-T^sts on Mean Performance as Measured by Five 
Scai< 'ri r,f Speaking Proficiency and Subtest and Composite 
Scenes of TOEFL for a Paired Sample of 18 Subjects 





Scale 


Test 
Period 


Mean 


N 


_t 


P 

(one-tailed) 


S.D. 


Listpning 


1 

2 


6.05 
6.50 


18 


-1.86 


.04 


.98 
.98 


Pronunciation 


1 
2 


5.36 
5.83 


18 


-2.19 


.02 


.89 
1.05 


Fluency 


1 
2 


5.41 
5.77 


18 


-2.18 


.02 


1.01 
.91 


Grammar 


1 
2 


5.52 
5.86 . 


18 


-1.23 


.11 


.86 
.85 


Overall 


1 

2 


5.50 
5.38 


18 


-2.30 


.01 


.82 
.70 




r 

Listening 


1 
2 


37.88 
48.94 


18 


-7.53 


.00 


5.50 
7.72 


English Structure 


1 

2 


38.40 
42.46 


15 


-3.63 


.00 


5.48 
5.99 


Vocabulary 


1 , 
2 


39.. 80 
41.26 


15 


-.79 


.2? 


8.01 
5.75 


Reading 


1 
2 


43.33 
45.66 


15 


-1.54 


.07 


6.74 
5.76 


Writing 


1 
2 


38.60 
43.00 


15 


-2.03 


.03 


7.44 
5.90 


Composite 


1 
2 


397.33 
447.61 


15 


-6.89 


.00 


44.24 
44.52 



er|c ^ ^ 



-184- 



the interview and TOEFL for the second testing period. This, indeed, 
turns out to be the case, as indicated in Tables 6 and 7. The former 
shows correlations between the interview and TOEFL that are not signif- 
icantly greater than zero. This is the first testing period, before 
instruction. The latter shows the correlations between the interview and 
TOlFL after instruction. 

In some cases the correlations are significantly greater than zero. 
The listening and grammar subscales of the interview and the overall scale 
correlate with the listening comprehension subtest of TOEFL at a level 
greater than zero. The greater gains in TOEFL were those on the listening 
and structure subtests. We see that in contrast to the first testing 
period, in which the group was most homogeneous on these two scales and 
at the lower end, the reverse is true in the second testing period. If 
passive control over structure has increased, one would expect a con- 
comitant increase in active control over that demonstrated in the first 
testing period. This should be related to levels of proficiency as 
demonstrated by the interview. Indeed, in the second testing period the 
listening and grammar scales of the interview are. correlated with the 
TOEFL listening subtest at a level greater than zeru. However, only about 
20-25 percent of the variance of the two tests overlaps, indicating that 
the two tests are measuring independent aspects of listening comprehension 
as well. . 

The vocabulary subtest of TOEFL also correlates at a level greater 
than zero on four of the five scales of the interview. Vocabulary 
scores do not show a significant , improvement from one testing period 
to the , other, but because of an improvement in pronunciation skills, 
pronunciation scores correlate very highly with the vocabulary subtest 
scores for the second test period. It appears that passive control 
over the lexicon is not very different from one period to another but 
active control is. Words are more than visually recognized; they are 
now articulated more precisely as they are spoken. At the same time, 
recognition of words in the flow of speech has improved, as evidenced 
by the change in listening comprehension scores. Thus, the higher 
correlation between listening comprehension scores in the interview and 
the -'ocabulary scores in TOEFL is an indication of greater active control 
over the lexicon. 

No improvement on the grammar scale of the interview is evidenced, 
nor is improvement on the vocobulary subteat of the TOEFL. Yet a cor- 
relation greater than zero exists between these two scales in the second 
testing period but not in the first. I have no explanation for this 
fact. Neither can I offer an explanation for the nonzero correlation 
between the pronunciation scale of the interview and the writino ability 
subtest of TOEFL. 

In general, this study suggests that the correlation between TOEFL 
subscores and the interview scores will be nonexistent when there is 
little active control c ' English. As active control of the language 
improves, the cnrclation between TOEFL subscores and the interview scores 



-185- 



TABLE 6 

Correlation of TOEFL with Five Scales of Speaking Proficiency 
ff?r First Testing Period (N = 18) 



TOEFL 



Interview Scale 


LC 
(N=18) 


ES 
(N=15) 


Voc 
(N=15) 


Rdg 
(N=15) 


WA 
(N=15) 


Composite 
(N=18) 


Listening 


.21 


.37 


.02 


.00 


.23 


.24 


Pronunciation 


.27 


.04 


-.07 


.10 


.22 


.14 


F luency 


.25 


.23 


.12 


-.10 


.22 


.21 


Grammar 


.09 


.00 


.07 


-.03 


.02 


.06 


Overal 1 


.26 


.12 


-.06 


-.04 


.11 


.10 






TABLE 7 








Corre lat ion 


of TOEFL with Five 
for Second Testing 


Scales of 
Period (N 


Speaking Proficiency 
= 18) 








' TOEFL 








Interview Scale 


LC 
(N=18) 


•ES 
(N=15) 


Voc 
(N=15) 


Rdg 
(N=I5) 


WA 
(N=15) 


Composit e 
(N=18) 


Listening 


.45* 


-.06 


.46* 


-.08 


-.27 


.28 


Pronunciat ion 


.19 


.29 


.70** 


-.24 


.51* 


.26 


F luency 


.36 


.24 


.42 


.05 


.14 


.39 


Grammar 


.47* 


.40 


.46* 


.08 


.32 


.48* 


Overal 1 


.54** 


'.09 


.44* 


.13 


.14 


.43* 



* p < .03. 
p < .01. 



1 .V - 



-186- 



becomes stronger. The mean TOEFL, score for these subjects is below the 
level one would judge necessary for full participation in an English- 
speaking class. Though the data are not available here to verify the 
prediction, one would expect that as speaking proficiency as measured by 
the interview continued to improve, a nonzero correlation between the 
grammar scale of the interview and the structure subtest of the TOEFL 
would begin to surface. This bears further investigation. 

The major conclusions to be drawn from this study are that direct 
testing of speaking proficiency under the conditions described is a fairly 
reliable procedure and that the interview cannot be expected to correlate 
with subtests of the TOEFL when proficiency is low and passive control 
exceeds activ - ontrol. As the difference between active and passive 
control diminisnes, the correlation between TOEFL and a direct oral 
proficiency test can be expected to greater than zero. The claim is 
that where performance on direct tests of oral proficiency is at a high 
level, TOEFL will tell us that, and where performance on a direct test of 
oral proficiency is low, there is no way to tell if it is due to a general 
lack of knowledge about the language or lack of skill in speaking and 
1 isten ing . 



Is:, 



-187- 



Appendix A 
Interview Evaluation / 

Name • Date 

Evaluator 



Poor 



Fair 



Good 



Above 
Average 



Excellent 



Comprehension 



Pronunciation 



Fluency 



Grammar 



Overall Oral 
Proficiency 



ERIC 



-188- 



Guidelines For Evalugtion oF Interviews 
ComprehenGion 



Excellent : 
Very Good: 

Good: 

Fair: 
Poor : 



Appears to understand everything without diFFiculty. 
Understands at nearly normal speed;, occasional repetition 
necessary. 

Understands a*- slower-t han-norma 1 speed; Frequent 
repetition necessary. 

Great diFFiculty Following questions and answers. 
Cannot be said ho understand even simple conversation. 



Pronunciat ion 

Excellent: 
Very Good: 
Good: 

Fair: 
Poor : 



Has Few traces oF Foreign accent. 

Always int el ligible , though deFinite accent present. 
Concentrated listening is necessary; errors cause 
occasional misunderstanding. 

Very hard to understand; repetition Frequently necessary, 
Speech virtually unintelligibl.^. 



F luency 

Exce 1 lent : 
Very Good: 
Good: 
Fair: 
Poor : 



Speech as Fluent and eFFort'less as that oF a native. 
Fluency slightly aFFected by language problems. 
Fluency rather strongly aFFected by language problems. 
Usually hesitant; Forced into silence by language problems. 
Halting and Fragmentary speech; conversation impossible. 



. t-ammar 

Lxcellent : 
Very Good: 
Good : 

Fair : 

Poor : 



Few, iF any, noticeable errors oF grammar or word order. 
Occasional grammatical and/or word-order errors. 
Frequent grammar and word-order errors that obscure 
meaning. 

Comprehension diFFicult; Frequent rephrasing; uses basic 
p ' terns. 

Severe errors in grammar and word order. 



-189- 



References 



Clark, John L. D. "Theoretica- and Technical Considerations in Oral 
Proficiency Testing." In Testing Language Proficiency , edited by 
Randall L. Jones and Bernard Spolsky, pp. 10-28. Arlington, Va.: 
Center for Applied Linguistics, 1975. 

Ebel, Robert L. "Estimation of the Reliability of Ratings." 
Psychometrika 16 (1951): 407-24. 

Hinofotis, Trances B. "Cloze Testing as a Substitute for Oral Interviews." 
Paper presented at the Preconf erence Workshop on Cloze Testing, TESOL 
Conference, Miami, Flo. , 1977. 

Hoyt, Cyril J. "Test Reliability Estimated by .'nalysis of Variance." 
Psychot.ietrika 6 (1941).: 153-60. 

MijJ-len, Karen A. "Rater Reliability and Oral Proficiency Evaluations." 
In Occasional Papers on Linguisti cs : Proceedings of the First 
International Conference on Front'-i s in Language ProfJci ej ncy and 
Dominance Testing, Carbondale^^ Illinois, 1977 , edited by James 
Redden, pp. 133-42. Carbondale: Department of Linguistics^. Southern 
Illinois University, 1977. 

Wilds, Claudia P. "The Oral Interviev/ Tost," In Testing Language 

Proficiency , edited by Randall 1.. Jones anH Bernard Spolsky, pp. 
29-44. Arlington, Va.: Center for Applied Linguistics, 1975. 

Winer, B. J. Statistical Principles in Experimental Design . 2d ed. 
New York: McGraw-Hill, 1971. 



RELIABILITY AND VALIDITY OF LANGUAGE ASPECTS 
CONTRIBUTING 10 ORAL PROFICIENCY OF 
PROSPECTIVE. TEACHERS OF GERMAN 



Ray T., Clifford 
Central Intelligence • ency 



JS7 

o 

ERIC 



RELIABILITY AND VALIDITY OF LANGUAGE ASPECTS 
CONTRIBUTING TO ORAL PROFICIENCY OF 
PROSPECTIVE TEACHERS OF GERMAN 

Ray T. Clirrord 

Introduction 

It has long been accepted as axiomatic that foreign language teachers 
must be proficient in the languages they teach. Axelrod (1966, p. 7) 
defines the "excellent foreign language teacher" as one who, along with 
other skills," . . . speaks the language intelligibly and v ith adequate 
command of vocabulary and syntax." The MLA statement of "Qualifications 
for Secondary School Teachers of Modern Foreign Languages" (1955, pp. 
46-47), hereafter referred to as the MLA Teacher Qualifications Statement, 
was reaffirmed in 1966 (Paquette, p. 373). It describes three levels of 
oral proficiency and includes a description of the situations where these 
skills at'e to be demonstrated: 

Minimal --The ability to talk on firepared topics (e.g., 
for classroom situations) without obvious faltering, and to. use 
the common expressions needed for getting around in the foreign 
country, speaking with a pronunciation readily understandable 
to a native... 

Good --The ability to talk with a native without making 
glaring mistakes, and with a command of vocabulary and syntax 
sufficient to express one's thoughts in sustainej conversation. 
This implies speech at normal speed with good pronunciation 
and intonation. ^ 

Super ior --The ability to approximate native speech in 
vocabulary, intonation, and pronunciation (e.g., the ability 
to exchange ideas and to be at ease in social situations). 

Test --F or the present, this ability -has to be te^sted by 
interview or by a recorded set of questions with a blank disc 
or tape for recording answers. 

It is interesting to note that this statement, published long 
before the debate over linguistic and communicative competence developed, 
recogni.-'ed a combination of both linguistic and communicative skills. 
Much of the discussion of "communicative competence" is directed toward 
students and does not include the linguistic st 'lis that would be expected 
of a teacher who provides a model of the target language, for his students. 
Likewise, it can be assumed that teachers must have a communicative 
competence beyond simple linguistic competence if they aro to teach others 
to communicate effectively. Therefore, the term "language proficiency" 
will be used in this study in its broadest meaning,, encompassing both 
linguistic and communicative skills. 



Is; 



ERIC 



-194- 



At this point, only two generally accepted methods of testing oral 
proficiency in foreign languages have been deve loped : the speaking 
portion of the MLA Cooperative Foreign Language Proficiency Test and the 
FSI interview procedure. Of these two procedures, only the MlA test has 
been used in assessing the language skills of pre- and inservice teachers. 
Clark (1975) contends that an interview could be used to test teachers 
and, accort j to him, it, would be a more direct, and therefore a more 
valid, measL.e of language proficiency than the generally used MLA tests. 
The ;?uthore. of the MLA Teacher Qualifications Statement quoted above 
also considered an interview as a possible mode of oral proficiency 
assessment. Although oral interviews are used by several government 
agencies, including the Foreign Service Institute, the CIA, the Peace 
Corps, and the Civil Service Commission (Wilds, 1975; Lowe, 1976), 
these techniques have not been widely used in or specially adapted to 
testing the language proficiency of teachers. 

To be useful, a language proficiency te^t must be both valid and 
reliable. The development i a proficienc> interview for teachers would 
provide two independently constructed tests of oral proficiency, which 
would allow inferences about the, concurrent and construct validity of 
those measures and about the relative reliability of an indirect measure 
of oral proficiency as compared to a direct interview situation. 



Research Problem 

This study developed an oral interview procedure for testing pro- 
spective teachers of German by adapting the established FSI interview 
procedures used by governn.-.'.t agencies to more closely parallel the MLA 
proficiency definitions. It then compared this Teacher Oral Proficiency 
assessment procedure with the only existing standardized test of foreign 
language competence for teachers that includes the testing of speaking 
skills: the MLA Cooperat ive Foreign Language Proficiency Test, 

The study also examined the concept of language aspects thought 
to contribute to oral prof ic-i ency . Both the FSI and MLA testing proce- 
dures identify the same ^^our factors as contributing to oral language 
proficiency: structure nr grammar, vocabulary, pronunciation, and 
fluency. These four aspects of oral language are also included in the 
language testing models proposed by Lado (1961), Cooper (1968), Carroll 
(1968), Harris (1969), and Valette (1971). However, both the MLA and 
FSI scaring procedures yield only overall scores, thus masking the 
contribution of the individual scores used in arriving at a total score. 

A total test: score implies a homogeneity of subcategories within 
the test. If, on the othe^^ hand, the scoring subdivisions used are 
independently valid, each should receive a separate score. One of the 
conclusions .reached by tne Minnesota Council of Teachers of Foreign 
Languages Working Committee on Teacher Certification in 1976 was that 
teachers should be at least minimally proficient in oach of these areas 



-195- 



af»d not just very good in ar,y one of the language aspects being con- 
sidered. Thus, if/ structure, vocabulary, pronunciation, and fluency 
do contribute independently to general oral language proficiency, scores 
-ihould be computed separately » ir each factor--both for providing 
fje?5cr ipt ive levels of proficiency with diagnostic value and for setting 
rninimurri levels for the certification of teachers. 

No empiriccil evidence has been produced that points toward the 
vHlidity of these contributing factors to oral language proficiency, but a 
fitatistical procedure suggested by Campbell and F iske (1967) seems ideally 
suited to providing such evidence. Referred to as "convergent and dis- 
criminant Vc.' dation," this procedure requires not only that indicators of 
a hypothesized factor converge (i.e., show high positive correlation with 
each other), but that they alisu be distinguishable from each other. In 
statirjticnl terms, this means that the indicators of each hypothesized 
factor correlate more highly with other indicators of the same factor than 
with indicators of other factors. 

In SL-rnrna ry , tfe rrit'* ; n questions investigated : i this study may be 
t.ru^riy sti'ited as follows: 

1. Is it possible to structure a valid and reliable oral interview 
and r^itinq procedure for directly assessing the oral language proficiency 
of prospective ^eachers of German? 

is the correlation between oral proficiency scores obtained 
^f'f^'- irect'* assessment [procedure and scores from the speaking 

''^'^''t of ,UA Cooperative Foreign Lanrjuage Proficiency Test? 

3. What are the inteirater, intrarater, and test-retest reli- 
abilities for the speaking portion of the German MLA Cooperative Foreign 
language Proficiency Test and for the oral interview procedure in the same 
s i tuat inn? 

^. Do fiieasures of the same aspects of oral language, arrived at by 
these diff("-ent testing procedures, correlate more highly with each other 
than they do with other lariijuage aspects measured by either proceduie? 



Procedures and Instrumentation 

The target population for this study was prospective teachers of 
German enrolled at the University of Minnesota. Because of the limited 
number of students applying for admission to the College of Education 
during nny one school year, the sample size was increased by including 
in the investigation all students who, in terms of language courses 
cof'pleted, were eligible to apply for admission during che :975-76 school 
year, wt --ther thiy actually did apply or not. In all, fifty students were 
contacted and forty-seven participated in the study. 



-196- 



/ The prof iciencyH&st used in this study was the Speaking Test, 
Form HC, of the MLA Cooperative Foreign Language Proficiency Tests: 
, German, formerly called MLA Foreig^: anguage Proficiency Test for Teachers 
/and Advanced Students; German (Buros, 1972). In this test of oral 
/ prof:?iency in German, students' responses to prerecorded and visual 
stimuli are recorded on audio tape for later scoring. The test lasts 
fifteen minutes and is divided into three parts. In Part A the examinee 
hear^ twenty recorded statements that he is to repeat. He is then scored 
on the correctness of his pronunciation on two selected phonetic element?? 
in each of the last fifteen statements presented. 

.Part B contains a printed selection that the examinee reads first to 
himself and then aloud< His pronunciation is again rated, on twenty 
selected phonetic features of the language, and his reading fluency is 
also rated, according to a i'ive-point scale ranging from failure to convey 
the meaning of the passage to performance like a native who reads well. 

In Part C the examinee is asked to describe orally a picture or a 
series of pictures. He is given three opportunities to respond ranging in 
duration from forty-fivg to ninety seconds per picture or series of 
pictures. The examj-^ee's performance is rated separately for each of 
the three picture situations in each of the areas of vocabulary, pro- 
nunciation, structure, e^d fluency. The rating scales are specific to 
each area, but all are rated according to a five-point scale ranging from 
inadequate to native performance. The resulting twelve ratings are 
totaled to arrive at the examinee's score on Part C. 

Jlie interview and rating procedure specifically designed to test 
prospective teachers of Gerrnan was named the Teacher Oral Proficiency 
(TOP) interview. It waj? developed by combining the various proficiency 
rating scales available into one general rating scheme that could be 
used in an interview situation to test the oral language proficiency of 
teachers. For this purpose a separate six-by-six matrix was developed for 
each of the languanr^ aspects of grammar, vocabulary, pronunciation, and 
fluency. One dimension of each matrix was divided into six proficiency 
levels, designated 0 to 5, and the other dimension was divided into 
categories according to the six available rating scales: the MLA Teacher 
Qualifications Statement, the rating scale from the MLA speaking profi- 
ciency test, the general FBI proficiency descriptions, the FSI grid of 
"Factors in Speaking Proficiency," the FSI supplementary proficiency 
descriptions, and the CIA supplementary rating criteria. 

Not all six rating scales described each skill area of grammar, 
vocnbulary, pronunciation, and fluency at each proficiency level, but each 
level was described by ^t least one rating scale. The matrices for 
grammar, vocabulary, pronunciation, and fluency were then presented to 
a "Second Languages and Cultures Education" seminar at the University 
of Minnesota, where graduate students and faculty members eliminated 
redundant proficiency de^Oriptions in the rating scales. This left a 
matrix of the unique contributions provided by each rating scale in 
describing each aspect of grgl proficiency at each level of proficiency. 
These four matrices were then collapsed to form one rating grid with 
separate rating scales for each language aspect. 



The combined rating grid was used both as a framework for structuring 
TOP interviews and as a rating scale Yor evaluating performance in thnse 
interviews. A TOP interview lasts fifteen to thirty minutes and is 
conducted in much the same way as an FSI interview. It may be conducted 
by one or two interviewers, who begin the interview wihh simple questions 
about general topics and then broaden the discussion as far as the 
language skills of the interviewee permit. When it is evident that the 
interviewee has been pushed beyond his highest level of perf ormarice, 
the discussion is returned to more generr?l topics before the interview 
is ended., so the interviewee will not perceive the experience as negative 
or frustrating. Ratings are assigned separately for the interviewee's 
performance in the areas of grammar, vocabulary, pronunciation, and 
fluency. 

The MLA speaking test and TOP interviews were administered twice 
each to the forty-seven students participating in the study. All tests 
and interviews were recorded on cassette tapes for later scoring by the 
author and three other raters, all n^-^tive Speakers of German, trained by 
him. Tapes from the first administration of the MLA speaking test ware 
scored first, then the tapes from the second MLA test administration. 
This was followed by a rescoring of the capes from the first MLA test 
administration. Ihe same procedure was followed in rating the taped lOF 
interviews, so each rater supplied three iriterview rsLings and three MLA 
speaking test scores for each student. 

These scores and ratings "ere then correlated to determine the 
reliability and validity of the MLA and TOP rr.' sures of oral language 
proficiency n German. Different computational procedures were used 
depending on the question to be investigated. Pearson product-moment 
correlations were calculated to estimate validity, while intraclass 
correlations weie used to estimate the respective reliability of both 
testing procedures. Convergent and discriminant validation criteria 
as established by Campbell and Fiske (1967) were applied as a test Oi* 
the construct" validity of the language aspects: grammar, vocabulary, 
pronunciation, and fluency. 

Several limitations are evident in this study. A major limitation 
results from the relaxed criterion used in selecting the sample of 
students to be tested. A sufficient number of students was tested to 
allow meaningful inferences about the theoretical relationship under 
study; however, the tested sample is one step removed from a truly 
representative sample of prospective teachers of German. Another limir 
tation is that all oral interviews were conducted by the same interviewer, 
making it impossible to measure or infer how much variance in students*^ 
scores might be caused by the interaction of interviewer and interviewee 
characteristics . 

A third limitation is that, in an e):aminat ion of concurrent validity, 
no one measure of proficiency can be assumed as the standard against which 
the other may be judged. Thus, a low correlation ''casts doubt on both 
measures, presumably equally" (Cronbach, 1971, p. 466). 



-198^. 



Results of the Study 

From a subjective viewpoint, this attempt at developing and using an 
interview procedure to test prospective teachers of German was a success. 
The modified rating scale served well as an underlying structure for 
conducting the interviews, and the raters experienced little difficulty in 
rating interviewees/ performance according to that scale. Empirically, 
the results were also favorable. 

A. Concurrent validity 

Concurrent validity of the MLA test and TOP interviews was estimated 
by computing Pearson product-moment correlations. Total scores from the 
. interviews correlated .834 with total MLA speaking test scores and .864 
with global ratings assigned in Part C of the MLA speaking test. 

B. Reliability 

All reliability coefficients were computed using intraclass cor- 
relational formulas, which--un like product-moment correlations—treat 
differences among the means of the correlated scores as error variance. 

1. Interrater reliability 

For both testing procedures, ratings of individual language aspects 
were less reliable than. the sums of those ratings. The intraclass, 
interrater reliability of total scores on the MLA test was .818, while 
for Part C it was found to be .829. The intraclass, interrater reli- 
ability of sums of ratings from TOP interviews was .827. The interrater 
reliability of the language aspect ratings on both testing procedures is 
summarized in Table 1. 

2. Intrarater reliability 

The mean int raclass , int rarater reliability coefficients for total 
scores followed the same pattern found with interrater reliability. 
The mean intrarater reliability of Part C of the MLA speaking test was 
found to be .911, which is slightly larger than the mean intrarater 
reliability of ,897 found for total MLA speaking test scores. The mean 
intrarater reliability of sums of ratings on the TOP interview was .930. 
The mean intraclass, intrarater reliability coefficients for language 
aspect ratings from both testing procedures are summarized in Table 2. 

3. Test-retest reliability 

The intraclass, test-retest reliability of total MLA speaking 
test scores ^nd those for Part C of the MLA test were both .940, while the 
test-retest reliability of sums of rstings from TOP interviews was found 
to be .893. As Table 3 shows, the test-retest reliabilities of individual 
language aspects were lower when rated from the interviews than when rated 
from the MLA speaking test. 



-199- 





TABLE 1 






Interrater Reliability of 
Language Aspect Ratings 




Language 
Aspect 


Part C, MLA 
Speaking Test 


^— . — 

--- TOP 
Interview 

/ 

/ 


Grammar 


.709 


/' 

.719 


. Vocabulary 


.770 


. .699 


Pronunciation 


.676 


/ .690 


Fluency 


.801 / 

/ 


.717 




/ 

TABLE 2 






Mean Intrarater Reliability of 
Language Aspect Ratings 




Language 
Aspect 


Part C, MLA 
Speaking Test 


_ 

TOP 
I n t e r V i ew 


Grammar 


"7 "7 1 


.903 


Vocabu] ary 


.853 


.867 


Pronunciation 


.826 


.836 


F luency 


.857 


.780 



C. Construct validity of contributing language aspects 

The mean scores of the language aspect ratings assigned students on 
the first administration of both the MLA test and TOP interview are given 
in Table 4. 

It. is interesting that the same relative ordering of mean scores on 
grammar, vocabulary, pronunciation, and fluency was found on both tests. 
Students were rated highest on pronunciation, followed in descending order 
by fluency, grammar, and vocabulary* 



ERIC 



-200- 



TABLE 3 

Test-Retest Reliability of 
Language Aspect Ratings 



Language 
Aspec;l: 




MLA 

Speaking Test 


TOP 
Interview 


Grammar 

/ 

Vocabulary 




.920 


.859 




.885 


.791 


/ Pronunciation 


.923 


.881 


/ Fluency 




.9 08 


.803 






TABLE 4 




Variables Examined for Construct 
of Contributing Factors 


V a 1 i ri i t V 




(N = 47 for all variables) 




/ Test 


Language 
Aspect 


Mean 


Standard 
ue VI a L jLon 


ML A 


Grammar 


7.82 


1.78 


MLA 


Vocabulary 


7.40 


2.11 


ML A 


Pronunciation 


8.76 


2.04 


MLA 


r luency 


7.88 


2.13 


TOP 


Grammar 


2.39 


0.69 


TOP 


Vocabulary 


2.25 


0.62 


TOP 


Propunciat ion 


2.64 


0.63 


TOP 


Fluency 


2.53 


0.71 



ERIC 



-201- 



A correlation matrix of the variables In Table 4 is found in Table 
5. This matrix of product-moment correlations was used to examine the 
ratings of grammar, vocabulary, pronunciation, and fluency for construct 
validity, according to the criteria for convergent and discriminant 
validation. The three essential criteria are: 

1. All correlation coefficients in the validity diagonal of the 
mult.it ra i t , multimethod triangle should be statistically significant and 
sufficiently large to indicate convergent validity. 

2. Each trait correlation coefficient in the validity diagonal 
should exceed in magnitude the correlations of that trait with other 
traits measured by a d if f erent method. 

3. Each trait correlation coefficient in the validity diagonal 
should exceed 'in magnitude the correlations of that trait with other 
traits measured by the same method. 

The validity coefficients in Table 5 have been underlined. The 
conditi.ons of criterion number 1 above were met by those correlations 
found on the validity diagonal of the matrix. The conditions of criterion 
number 2 were met for the language aspects of pronunciation and fluency, 
but, because of a high correlation between TOP grammar ratings and MLA 
vocabulary ratings, theyjwere not met for the language aspects of grammar 
^:ind vocabulary. The conditions specified by criterion 3 were not consis- 
tently met by any of the validity correlations. The multitrait, multi- 
method correlation matrix in Table 5 gave some indication of convergent 
and discriminant validation, but because of apparent method variance 
. introduced by the particular testing procedure used, none of the language 
aspects met the conditions of criterion number 3. Therefore, validation 
of the lanquaqe aspect js hypothesized as contributing to oral language 
proficiency was not achij^ved using this multimethod matrix. 

A multitraih, mul'tirating matrix of the correlations between average 
first and second ratings of the same test administration showed different 
results. The resulting matrix for the TOP interview is shown in Table 6, 
and the matrix for the MLA test is in Table 7. 

Correlating the mean scores assigned students on the hypothesized 
language aspects on the first and second ratings of the same test admin- 
istration for each procedure in effect controlled for error variance 
m the students' scores resulting from method variance, interrater 
variance, and crait instability. Under these ideal conditions, with high 
mtrarater re 1 lab i 1 i t y of mean scores on each of the language aspects, 
all the criteria were met for convergent and discriminant valication of 
grammar, vocabulary, pronunciation, and fluency. Table 6 reveals no 
exceptions to the ideal requirements of convergent and discriminant, 
validation of the four language aspects using mean scores on the TOP 
interview. Similarly, the correlated mean scores from Part C of the MLA 
speaking test presented in Table 7 show only one minor flaw: the corre- 
lation of the secor.d rating of vocabulary with the second rating o 
grammar exceeds the correlation between first and second ratings of 
grammar by .001 . 

in, 

o 

ERIC 



TABLE 5 



Multitrait, Multimethod Convergent and Oiscrifninant Validation Matrix 

(N : 47 for all variables) 



Language 
Aspect 



Grammar 
Vocabulary 
Pronunciation 
Fluency 

Grammar 
Vocabulary 
Pronunciation 
Fluency 



.876 



Correlations in the validity diagonal 
• are underlined. 



.882 


.775 




All correlations in this 
significant at the p < 


matrix are 
.001 level. 




.845 


.946 


.731 












.810 


.827 


.752 . 


.783 










.744 


.816 


.683 


.796 


.876 








.741 


.670 . 


.788 


.643 


.838 


.740 






.687 


.802 


.657 


,.819 


.864 


.825 


.731 




MLA 
Gr. 


MLA 
Vo. 


MLA 
Pr. 


MLA 
Fl. 


TOP 
: Gr. 


TOP 
Vo. 


TOP 
Pr. 


TOP 
FL 



VJ 



ERIC 



TABLE 6 

TOP Interview Multitrait, Multirating Convergent 
and Discriminant Validation Matrix 



Test Language 
Rating Aspect 



First 
First 


Grammar 
Vocabulary 




.876 






Correlations in the validity diagonal 
are underlinedv 




First 


Pronunciation 


.838 


.740 




All correlations in this matrix are 
significant at the p < .001 level. 




First 


Fluency 


.864 


.825 


.731 












Second 


Grammar 




.832 ' 


.824 


.829 










Second 


Vocabulary 


.883 


.943 


.799 


.855 


-.891 








Second 


Pronunciation 


.829 


.750 


.909 


.722 


.810 


.805 






Second 


Fluency 


.814 


.716 


.694 


.908 


.813 


.791 


.722 








1st 
Gr. 


1st 
Vo. 


1st 
Pr. 


1st 

ri. 


2nd 
Gr. 


2nd 
Vo. 


2nd 
Pr. 


2nd 
Fl. 



TABLE 7 



MLA Speaking Test Multitrait, Multirating Convergent 
and Discriminant Validation Matrix 



Test Language 
Rating Aspect 



First 


Grammar 




First 


Vocabulary 


.876 


First 


Pronunciation 


.882 


First 


Fluencv 


.845 



.775 
.946 



Correlations in ttie validity diagonal 
are underlined. 

All correlations in this matrix are 
significant at tbe p < .001 level. 



.731 



Second 


Grammar 


.937 


.901 


.837 


.890 








jecond 


Vocabulary 


.856 


.953 


..769 


.915 


.938 






jecond 


Pronunciat ion 


.853 


.758 


.942_ 


.743 


.869 


.802 




iecond 


Fluency 


.795 


.914 


.707 


.963 


.886 


.926 


.739 






1st 
Gr. 


1st 
Vo. 


1st 
Pr. 


1st 
Fl. 


2nd 
Gr. 


2nd 
Vo. 


2nd 
Pr. 



2nd 
Fl. 



I'' 



-205- 



Conclusions 

As shown in Table 8, ratings from TOP interviews were generally as 
reliable as scores on the MLA speak ing test , indicating that an oral 
interview procedure can be developed that matches the reliability of the 
more structured MLA speaking test. 

TABLE 8 

Summary of Intraclass. Reliability Coefficients 

for MLA and TOP Assessment Procedures ^ 





Interrater 


Jntrarater 


Test-retest 


Test Score 


Reliability 


Reliability 


Reliability , 


MLA Speaking 








Test 








Total Score 


.818 


' .897 


.940 


MLA Part C 




( 
r 




Score 


.829 


; .911 


.940 


Sums of Ratings 




i 




from TOP Inter- 








views 


.827 


, .930 


.893 



It is also interesting that the reliability of Part C scores on the 
MLA test, which calls for free responses from examinees, was found to be 
as reliable as total MLA scores. Part C scores also correlated more 
highly with ratings fronri TOP interviews than did total MLA speaking 
scores. The product-moment correlation between Part C scores and su.is of 
language aspect ratings from TOP interviews was .864, which approaches the 
i-.est-retest reliability of the TOP interviews. Thus, Part C of the MLA 
test and the TOP interview seem to be generally measuring the same skill. 

Interrater reliability was about equal for the MLA test and the TOP 
interview. Intrarater reliability was higher for the TOP interview than 
for the MLA speaking test, but for test-retest reliability the situation 
was reversed. This may have been- the result of two factors. First, 
intrarater reliability may have been improved by the inore detailed rating 
criteria used for rating the TOP interviews. Second, whereas the. content 
of the MLA speaking test was exactly the same from one test administration 
to the next, TOP interviews were not identical in content. Adequacy of 
language content sampl'^jd may be a problem with both types of tests. '.The 
language sample provided by the MLA test is quite limited in scope, 
while the content of the TOP interview is dependent on the skill of the 
ip*"erviewer . 



-206- 



Correlations of ratings assigned the language aspects of grammar, 
vocabulary, pronunciation, and fluency using different testing and rating 
procedures ranged from .788 to .819. However, high correlations were 
found between different language aspects rated by the same method, which 
precluded convergent and discriminant validation of contributing language 
aspects across testing methods. This may indicate a halo effect among 
ratings assigned at the same time from the same speech sample, as well 
as variance resulting from different testing procedures and trait insta- 
bility. Evidence of construct validity for the language aspects of 
grammar, vocabulary, pronunciation, and fluency was found by applying 
convergent and discriminant criteria to two independent ratings of the 
same test administration. Validity correlations consistently exceeded .90 
for both testing procedures. 



Implications and Recommendations for Further Study 

The results of this study de.Tions t rat e that more direct measures 
of oral language proficiency may be as reliable as lesb direct but 
more structured standardized tests. The logical assumption that direct 
measures of oral language proficiency more accurately assess the skill 
being measured (Clark, 1972a) therefore indicates an advantage in testing 
by means of an interview. However, the high correlation of the MLA test 
results (especially Part C) with the interview ratings, combined with 
practical advantages in ease of administration offered by the MLA test, 
may make it an acceptable alternative in some siti-ations. 

Convergent and discriminant validation of grammar, vocabulary, 
pronunciation, and fluency ratings within testing procedures indicates 
that these aspects of oral language prcficiency can be defined and 
measured reliably enough to provide a meaningful diagnostic profile of 
skills contributing to general oral proficiency. 

Continued research should be conducted on the construct validity of 
the language aspects of grammar, vocabulary, pronunciation, and fluency 
to determine whether rating these language aspects independently with 
an intervening lapse of time may reduce the correlations found between 
different language aspects rated by the same method. Research should 
also be undertaken to deter^iine if the language aspects of grammar 
and vocabulary may be more effectively tested with other assessment 
procedures, such as. written tests. 




-2D7- 



References 

Axelrod, Joseph. The Education oF the Modern Foreign Language Teacher 
for American Schools . New York: The Modern Language Association, 
1966. 

Brifere, Eugene J. "Current Trends in Second Language Testing. " TESOL 
quarterly 3 ( 1969) : 33 3-:40. 

"Are We Really Measuring Proficiency with Our' Foreign 

Language Tests?" Foreign Language Annals 4 ( 1971 ): 385-91 . 

Bryan, Miriam M. "MLA Foreign Language Profiency Tests for Teachers 
and Advanced Students." The DFL Bulletin 5,i (1965):4-7. 

Buros, Oscar Krisen. The Seventh Mental Measurements Yearbook . Volume 
II. Highland Park , N. J. : Gryphon Press, 1972. 

Campbell, Donald T., and Fiske, Donald W. "Convergent and Discriminant 
Validation by the Mu 1 t i t ra i t-Mu 1 t ime t hod Matrix." In Principles 
of Educational and Psychological Measurement , edited by William A. 
Mehrens and Robert L. Ebel, pp. 273-302 . Chicago: RandMcNally, 
1967. 

Carroll, John B. "Problems of Testing in Language instruction: Some 
Principles of Language Testing." In Report of the Fourth Annual 
Round Table Meeting on Linguistics and Language Teaching , edited 
by Archibald A. Hill, pp. 6-10. Monograph Series on Languages 
and Linguistics. Washington, D.C.: Georgetown University Press, 
1953. 

' "Foreign Language Proficiency Levels Attained by Language' 

Majors near Graduation from College." Foreign Language Annals 
1 (1967a):131-51. 

• The Foreign Language Attainments of Language Ma.jors 

in . the Senior Year: A Survey Conducted in U.S. Colleges and 
Un i versities . Cambridge, Mass.: Graduate School of Education, 
Harvard University, 1967b. [ EDRS: EC 013 343.] 

• " The Psychology of Language Testing." In Language 

Testing Symposium: A Psycholinguistic Approach , edited by Alan 
Davies, pp. 46-69. London: Oxford University Press, 1968. 

Clark, John L. D. Foreign-Langu a ge Testing: Theory and Practice . 
Philadelphia: The Center for Curriculum Development, 1972a. 

• "Measurement Implications of Recent Trend.B in Foreign 

Language leaching." In Foreign Language Education: A Re a ppraisal ., 
edited by Dale L. Lanqe and Charles J. James, pp.- 219-57. The ACTFL 
Review of Foreign Language Education, Volume 4. Skokie, 111.: 
National Textbook Company, 1972b. 



-208- 



. "Theoretical and Technical Considerations in Oral Profi- 
ciency Testing." In Testing Language Proficiency , edited by Randall 
L. Jones and Bernard Spolsky, pp. 10-28. Arlington, Va.: Center for 
Applied Linguistics, 1975. 

Cooper,. Robert L. "An Elaborated Language Testing Model." In Problems m 
Foreign Language Testing , edited by John A. Upshur and Julia Fata, 
pp. 57-72 . Lar.guage Learning. Special Issue No. 3. Ann Arbor, 
Mich.: Research Club In Language Learning, 1968. 

Harris, David P. Testing English as a Second Langu age. New York: 
McGraw-Hill, 1969. 

Lado, Robert. Languag e Testing: The Construction and Use of Foreign 
Language Tests . London: Longr.ans, Green and Co. Ltd., 1961. 
Reprinted, New York: McGraw-Hill, 1965. 

Lowe, Pardee, Jr. "Oral Proficiency lesting: How and Why?" Presented at 

University of Minnesota German Department Roundtable, February 13, 
J975. 

• "Oral Interview Applications, Problems and Research: A 



Survey." In terview Testing Newsletter 1 (1976) :l-2. 

Manual for Peace Co rps Language Testers ^ Princeton, N.J.: Educational 
Testing Service, n-d. 

MLA Foreign L anguage Proficiency Te sts for Te achers and Advanced Students . 
Princeton, N.J.: Educational Testing Service, 1966. Now known as 
MLA Cooperative Foreign Language Proficiency Tests. 

MLA Interpretation of Scores . Leaflet. Princeton, N.J.: Educational 
Testing Service, 1966. 

Myers, Charles T., and Melton, Richard S. A Study of the Relationship 
Between Scores on the MLA Foreign Language Proficiency Tests for 
Teachers and Advanced Students and Ratings of Teacher Competence . 
Princeton, N.J.: Educational Testing Service, 1964. (EDRS7eD~qTi 
750T] 

"Qualifications fcr Secondary School Teachers of Modern Foreign 
Languages." Publications of the Modern La ngu age Association of 
America 70, iv (1955) :46-49. ' 

Spolsky, Bernard. "Concluding Statement." In Testing Language Profi - 
ts lency , edited by Randall L. Jones and Bernard Spolsky, pp. 139-43. 
Arlington, Va.: Center for Applied Linguistics, 1975a. 



-209- 



• '^Language Testing--The Problem of Validation," In Papers 

on Language Testing 1967-1974 , edited by Leslie Palmer and Bernard 
Spolsky, pp, 146-53. Washington, D.C.: Teachers of English to 
Speakers of Other Languages, 1975b. 

Valelce, Rebecca M. "Evaluation of Learning in a Second Language." 
I n Handbook on Formative and Summative Evaluation of Student 
Learning , edited by Benjamin S. Bloom> J. Thomas Hastings, and George 
F. Madaus, pp. 815-53. New York: McGraw-KiJi, 1971. 

Wilds, Claudia P. "The Oral interview Test." In Testing Language 
Proficiency , edited by Randall L. Jones and Bernard Spolsky, pp. 
29-44. Arlington, Va.: Center for Applied Linguistics, 1975. 



INTERVIEW TESTING RESEARCH AT 
EDUCATIONAL TESTING SERVICE 



John L. D. Clark 
Educational Testing Service 



INTERVIEW TESTING RESEARCH AT EDUCATIONAL TESTING SERVICE 



John L. D. Clark 

Educational Testing Service has been involved in interview testing 
activities for about the past nine years. The first and largest of 
th ese activities is an ongoing project with the Peace Corps that began in 
1969. During the first two years of the project, ETS language department 
staff — following an initial period of intensive training at the Foreign 
Service Institutes-conducted a large number of interviews of Peace Corps 
trainees and volunteers, both in the U.S. and at in-country duty stations. 
For the past seven years, however, ETS collaboration with the Peace 
Corps has focused on the training of in-country Peace Corps personnel 
to conduct and rate interviews in the host country language, using an 
English-medium training program described in greater detail elsewhere in 
these proceedings . 1 To date, approximately 560 interview testers 
in 55 countries have been trained and certified under this program and 
have administered some 18,000 interviews. 

A second program in which ETS has been participating involves the 
training of interview testers in English and French at the secondary 
school level in cooperation with the New Brunswick (Canada) Ministry of 
Education. This project is also described in greater detail, and from 
the perspective of a "front-line" New Brunswick interviewer, in a separate 
present ation .2 

One recent project, while of a smaller overall scale than either the 
Peace ^Corps or the New Brunswick program, has permitted ETS to carry out 
a number of research studies and analyses in the areas of interview 
training, interview format, and ^coring procedures that may be of interest 
to others using the interview technique or' involved in the interpretation . 
of interview results. This project derivec" from an interest on the part 
of the TOEFL (Test of English as a Foreign Linguage) program, at ETS in the 
possibility of developing a test that could De use.d operationally within 
the TOEFL program as a measure of active speaking ability. Although the 
use of a direct, face-to-face interview would have been ideal from a 
theoretical standpoint , the cost and administrative complexity of offering 
thi3 capability at each of the hundreds qf TOEFL testing sites worldwide 
dictated the development of a tape-recorded test supplemented by a printed 
test booklet rather than a face-to-face test. 

Even though a direct prof iciency int er view was not operationall-y^ 
feasible within the TOEFL program, the research committee overseeing * 
the speaki-ng ,t es t study recommended that a direct proficiency measure 
be . used as the criterion instrument :against which the less direct testing, 
procedures could be compared and validateti. It was further recommended 



ISee Lovelace paper, this volume. 
2See Albert paper, this volume. 



that, even before undertaking this portion of the study, the interview 
procedure itself be thoroughly investigated with respect to intra- and 
interrater reliability, the efficacy of the interviewer and rater training 
procedures, the effect of differing interview lengths, and related ques- 
tions. These activities were carried out between January and March 1977 
and produced the bulk of the experimental data reported here. Before 
presenting the study results, it will be useful to briefly describe 
the scope of the study and the specific procedures followed. 

The basic procedural approach of the TOEFL study was to carry out, 
"from scratch," each of the activities involved in: the initial training 
of interviewers; interviewing under realistic administration conditions; 
and, finally-, -intervi-ew rating, both on-the-spot and at a later time by 
means of a tape recording made of each interview. 

A total of four prospective interviewers were identified from a 
group of approximately twel ve Candida tes , selection being made through 
inspection of resumes followed by personal interviews. All four inter- 
viewers were native speakers of English at the undergraduate or graduate 
, level and had an excellent technical knowledge of English through various 
combinations of undergraduate and graduate level English study, graduate 
, linguistics courses, and ESL teaching experience. 

The training process for the four interviewers was essenlially 
the same as for the Peace Corps and New Brunswick testers. Specifically, 
each interviewer attended an intensive two-day session in which ETS staff 
explained in detail the nature and operation of the- interview and of 
the interview scoring procedure. Demorst rat ion interviews were also 
conducted and critiqued as a group. During the late afternoon and evt ning 
of the two training days, each participant listened to a seriei of 
fifteen training tapes d. interviews at score levels 0+ to 4+ to pre vide 
additional familiarisation and, practice with the scoring scale. The : inal 
step in the training process was to have each participant listen to and 
rate a second, randomized series of fifteen interviews for which the 
off. ial, score levels were not known in advance. For each trainee, the 
extent to which the trainee^scorfrs-Gn-a-ll- fifteen tapes corresponded with 
the official le^vels was taken as a measure of rating accuracy. 

Approximately three and a half weeks after the initial training 
session, the four newly trained interviewers and the present writer 
"carried out a three-day session of interview testing at the American 
Language Program (ALP) at Columbia University with a group of under- 
graduate and graduate students taking ESL courses at the ALP. A total of 
eiqhty-six students part icipat_ed in. the interviewing: forty-nine men and 
thirty-seven women, 'ranging in age from seventeen to sixty-one (S.D.=8.57) 
and representing twenty-six different languages. 

[he students were scheduled to appear for the interviewing over 
a three-day period at thirty-minute intervals. On arrival at the testing 
site, each student was asked to fill out a short questionnaire giving 
bc\sic identification information.' In addition, the student was asked 



/ 



-215- ■ 

/ 

to indicate his or her instructional level (present course--pIacemen^. 
at the -ALP. and to give a self-rating of speaking proficiency on^feT'O-S 
basis, using the regular verbal descriptions of each score^ level. This 
was accomplished by having the student read over each of the verbal 
descriptions and place a check mark opposite the description that was 
considered to best reflect his or her level of proficiency in spoken 
English. The questionnaire and self-rating information were put aside 
and were not seen by the interviewers at any point in the interviewing or 
rating process. 

In order to explore the psychomet ric properties of an interview, of 
appreciab ly shorter length than the usual ( approximately twenty-minute) 
interview, each student was Poked to participate in both a regular-length 
interview (hereafter, "long' interview) and a considerably abbreviated 
("short") interview that was intended to run for a total of only five 
minutes. The order of interviewing was such that approximately half 
the students received the' long interview first, followed immediately by 
the short interview, and half the short interview,. fo,llowed immediately 
by the long. To avoid a r:arry-over or "halo" effect between long and 
short interviews, different interviewers were used to conduct the long and 
short interviews for a given student. Actual running times for the long 
interviews ranged from lO'lO" to 26'27", with a mean duration of 18'6" and 
standard deviation of 3'43". The short interviews ranged in length from 
4'20" to 8'54", with, a mean of 6'33" and standard deviation of I'S". 

Over the three-day interviewing session, e^ch interviewer .conducted 
both long and short interviews for approximately equal total periods of 
time. Three interviewers began the session with long interviews and two 
with short interviews to counterbalance any sequence-of-int erviewing 
effects across interviewers. 

Both the long and: short inter views were conducted on a one- 
int ervi ewer-per-s t udent basis, with no observers or "second raters" 
present. All int erviews were cassette recorded, with small lapel micro- 
phones worn by the interviewer and the examinee. Immediately following 
the interview, theMnt erviewer evaluated the examinee^s performance, using 
the regular verbal criteria (including "pluses" where applicable) and 
noted this rating on the scoring form. However, the examinee was not 
Informerj of the rating at that time and the rating was not communicated in 
any wa> to the other interviewers. 

The on-^ i t e i nt e r vi ewinq sessions provided f ou^ basic types of 
pxfim i nee dat a : 

1. the examinee's course placement at the ALP; 

2. self-rating of speaking proficiency; 

5. on-the-Gpot interview rating based on a long interview format; 

h . on-t he-^pot interview rating based on a short interview f ormat . 



-216- 



In addition to the above data, ALP staff made available each 
student's scores on a multiple-choice placement test administered by 
the ALP on entry to the language training program. .This test consisted 
of a 60-item recorded listening comprehension section and a 120-item 
section covering English grammar and vocabulary. The placement test 
scores were not communicated to any of the interviewers until both on-site 
interviewing and rerating of the. recorded interviews had been completed. 

Approximately two weeks after the ori-site interviewing session at 
ALP, each of the five interviewers listened to and rated all the tape 
recorded interviews, both long and short, including, those he or/ she had 
given. The tapes were sequenced in such a way that, for each rater, 
approximately fifteen short interviews were followed by fi/teen long 
interviews, or vice versa, until the rerating was completed . .Jn no 
event were the long and short interviews for a given student listened 
to back-to-back; they were in all instances separated by at least fifteen 
intervening interviews. Discussions with the interviewers following 
the rerating process -indicated that the raters could not remember 
individual examinees or the scores initially assigned, except for one or 
two examinees' at the highest and lowest extremes of the score, scale whose 
scores were remembered by the original rater because of the uniqueness of 
the performa.nce. For all practical purposes, however, and because of 
'the great number of interview tapes to be judged, the raters were not able 
to recollect the initially assigned scores when rerating the interviews-. 

On completion of the rerating phase, -four further types of infor- 
mation were available to the study: 

1. reratings of the regular long interviews by the original inter- 

viewer; 

2. reratings of the short interviews by the original interviewer; 

3. reratings of the;long interviews by each of four additional 

raters; 

4. reratings. of the short interviews by each of four additional 

raters. 

On the basis of the data obtained across the different phases of 
the study, it is possible to provide at least some empirically based 
information addressed to several different aspects of the interviewing and 
interview scoring process. To facilitate the presentation of resi^lts, 
generalized topical headings applicable tp ir.terview testing and research 
in a variety of contexts are used, followed by a description of study 
results bearing on that particular topic. 



Tester Performance during T raining and In-field Rating Accuracy 

. i ■ ■■ 

/s previously described, each of the four interviewers trained for 
the/^OEFL study was asked to rate a series of fifteen official test tapes, 
ranging f rom 0+ to 4+, as a measure, of end-of-training rating accuracy. 
For each tape, the score given by the tester was compared to the official 



-217- 



score. Trainee scores a "plus"* above or below the official score (for 
example, an official 2 rated as a 2+ or a 4 rated as a 3+) received 
a discrepancy weight of plus or minus 0,5." Any scores given by a trainee 
that were one level above or below the official score received a 
discrepancy weight of plus or minus l.D. For each tester, the discrep- 
ancies across all fifteen tapes were summed and both the absolute mean 
values and the signed mean values (taking into account the direction 
of the discrepancy as well as its magnitude) were determined, as shown in 
Table 1. 



TABLE 1 

Comparison of Rater Training Accuracy with 
Operational Scoring Accuracy 



Rater 

A 

B 
C 
D 



Rater 



C 
D 



•Absolute Mean 
Training Discrepancy ^ 

.40 
.50 

• . -10 
.30 



Absolute Mean Deviation 
in Operational Rating 



.603 (n.s,) 



.22 
.40 

.24 
.30 



Signed Mean 
Training Discrepancy 

.00 
-.27 
.10 

-i.no 



r = .963 (p < .01) 



Signed Mean Deviation 
in Gperationnl Rating 

.10 
.05 
.08 
-.01 



^See text for definition of column entries. 



As a. measure of rating accuracy, for each of the testers, when working 
in an operational setting some. weeks after training, the average of 
the ratings (across raters) given to each long interview during the 
relisteninq phase of the study was calculated. ^For each rater, the 
discrepancy of the rater's score from the average score for that interview 
was obtained. For each rater, the discrepancies across all interviews- 
were summed and mean discrepancies, both absolute and signed, were 
ca-lculated (right-hand column of Table 1). 



-218- 



A correlation of .603 was Tound between the absolute mean training 
discrepancy for a given rater and the corresponding absolute mean 
deviation in operational rating performance. With the small sample size 
(N=4), this correlation does not Teach statistical significance. However, 
for the signed mean discrepancies, the obtained correlation was .963, 
significant at the p < . 01 luvel and indicating a positive relationship 
between this end-of-training ■ variable and interview scoring accuracy. 

A caution in interpretation should, however, be noted. The elapsed 
time between initial training and operational scoring was relatively brief 
(approximately six weeks), and it is possible that the testers* scoring 
performance over a longer time period might exhibit variations from the 
initial training profile that were not in evidence over the period of the 
study. However, even taking this consideration into account, the obtained 
results for the signed discrepancy analysis would appear to provide a 
reasonable degree of validation for the use of this end-of-training 
measure as an indicator of probable rating performance in the field. 



Intrarater Reliability 

The TOEFL study provided some information on the intrarater 
reliability of the interview technique— th^t is, the extent to which 
individual raters *'agree with themselves*' when rescoring interviews to 
which they have earlier assigned ratings'. Each of the five interviewers 
^sd initially interviewed approximately .seventeen students face-to-face 
with the long interview format and - approximat ely seventeen other students 
with the short format': During the rerating phase of ^the study, each 
intervie^^^r listened to and rescored each of the interviews, long 
^^DsL^aft^. that he or she had conducted, as well as those, of the other 
interyi^ers. This activity provided , inf rarater. reliability information 
for eac?h of the raters, as shown in Table 2. 



TABLE 2 



Score-Rescore Reliabilities of 
Individual Raters 





Long Interview 






Short Interview 




Rat-er 


r 


K 


Rater 




N 


'•A 


.907 


17 


A 


.837 


17 


■ B 


.860 ■ 


17 


B 


.904 


14 


C 


.947 


J9 


c. 


.85? 


15 


D 


.771 


. 17 


■ D 


.740 , 


18 


E 


.840 


15 


• E 


.751 


11 



-219- 



For the long interviews, intrarater (o^^ "score-rescore") relia- 
bilities of .771 to .947 were obtained, with an average reliability of 
.867. Reliabilities For thir short interviews were slightly lower, ranging 
From" .740 to .904, with an average reliability oF .817. In all but one 
instance, the short interview reliability For a given interviewer was 
slightly lower than the long interview reliability; the single exception 
was interviewer B, with long and short interview reliability Figures oF 
.868 and .904, respectively. 

The intrarater reliability data also provide some information on the 
question oF whether interview raters tend to evaluate examinee perFormance 
diFFerent ly depending on whet her the rating is carried out on-the-spot or 
is based on a tape recording oF the interview that is listened to later. 
For both long and short interviews, the mean scores oF each rater For 
both the initial (f ace-^t o-Face) and subsequent (taped) ratings oF those 
examinees he or she had int erviewed are shown in Table 3 . NonsigniFicant 
diFFerences in the mean scores Fur initial rating and rerat ing were Found 
For raters A, B, and C in the long int erview" ^ituat ion and For raters A 
and B in the short interview situation. However, For the long interviews, 
raters D and E assigned signiFicant ly higher scores (p < .05) to the 
rerated tapes than they had assigned during the Face-to-Face interviewing. 
For the short interviews, raters D and E were joined by rater C, who also 
"fjave signiFicant ly higher scores to the rerated tapes. Although the 
nean scores^ For the' other rater/interview combinations did not vary 
signiFicant ly , in three oF the Four cdmparisons the numerical value of the 
mean was higher For the rerat ings. 

■ TABLE 3 ■ ; 

Mean Initial Ratings and Reratings 
Assigned by Individual Testers 



' Long Int er view 
Initial Rating Rerating 



Rat er 


■ Mean 


S.D 


N 


Mean 


S.D. 


N 


A 


2.55 


1.12 


17 


2.33 


1^06 


■ 17 




2.94 . 


.75 


17 


3.02 


.49 


17 


(; 


2.54 


.00 


19 


2.68 


.91 


19 


.0 


2.07* 


.7 3 


17 


2.63* 


.63 


17 


F 


2.55* 


.7.5 


15 . 


■ 3.02* 


.82 ^• 


16 








Short Inherview 








A 


2.25 


. .Bl 


17 


2.34 


.73 


17 


\] 




1.11 


.14 


2.66 


.75 


14 


c 


2.4 1'* 


.59 


15 


2.6y* 


.84 


15 


D 


2.41* 


1.04 


If] 


2.09* 


/ .67 ■ 


. 18 


f 


2.20* 


.79 


11 


2.93* 


1.04 


11 


♦in it ial 


and rerrit inq 


fnearir; 


diFFer at p < .05. 









-220- 

/• 

From an operational standpoint, intrarater differences in scores 
assigned to tape-based ratings and on-the-spot ratings would not be a 
troublesome factor if the particular testing program utilized one of these 
two types of scOring(procedures exclusively , that is, if reported scores 
involved only on-the-3pot scoring or only tape-based scoring. However, 
.or programs in which reported scores can include both on-the-spot and 
tape-based rating, it would appear desirable to carefully investioate 
possible rater differences due to the type, of scoring procedure and to 
make allowance for any such' differences in the use and interpretation of 
interview results. 




Interrater Reliability 

Interrater reliability refers to the extent to which two or more 

raters agree with one another on the scores they assign to given 

examinees. At ETS, data relating to the interrater reliability of the 

interview procedure have been obtained both from the TOEFL study and in 

connection with an interviewing program for Spanish-English bilingual and 

English-second-language teachers and teacher certification candidates in 
New Jersey . - 

In the TOEFL study, the five participating raters were asked to 
listen to and score a series of taped English interviews they and their 
four colleagues had conducted earlier on a face-to-face basis. A total of 
eighty-six long interview tapes were scored by all five raters. However, 
because of certain administrative problems in distributing the short 
interview recordings to the raters, it was not possible in several 
instances for all five raters to listen to and score a particular short 
interview. Ifiterviews for which even a single: rating was missing were 
removed from the analysis, leaving a total of sixty-eight short interviews" 
for which complete scoring data (scores from' all five raters) were 
available. / 

In the New Jersey study, four trained Spanish raters listened to and 
scored a total of eighty-six Spanish interviews drawn from the pool of 
interviews that had been conducted by the' time of the study. For all 
three sets of data (long and short TOEFL Enqdish interviews and New Jersey 
Spanish interviews), intercorrelatidns of the scores assigned by the 
raters were calculated. These are sho^n in Table 4 together with the 
(arithmetic) mean correlation for each of the three correlation tables. 
As a general observation, i; may be suggested that tha obtained correla- 
tions for bdlh the TOEFL and New Jersey data are within the overall levels 
of scoring reliability that would be expected fbr a nonobjective testing 
format of this type. The correlations also indicate that in all three 
scoring instances, the raters were able to rank the performance of the 
examinees whose interviews they evaluated in much the same way. 



/ 



-221- 



TABLE 4 



/ 



Interrater Correlations for Three Sets of Recorded Interviews 



TOEFL 


Interviews- 




(N=86) 




t \ cJ L C 1 




□ 

D 




u 


A 


1 nnn 








B 


• U U J. 


nnn 

• u u u 






C 


.602 


.705 


1.000 




D 


7Rn 


7RR 


7 1? 

• / J. 


1 nnn 
1 • uuu 


. E 


.814 


.804 


.593 


.711 


TOEFL 


Int er views- 


-Short 


(N=68) 




Rater 


A 


B 


C 


0 


A 


i.noo 








•B • 


.857 1 


.000 






C 


' .778 


.741 


1.000 




D 


.771 


.767 


.744 


1.000 


E 


■ .752 


.782 


.67? 


.709 


New Jersey Interviewf, (N=86) 





1.-000 



Mean r = .735 



1.000 Mean r = .758 



Rater 


3 


K 


L 


■3 


1.000 






K 


.900 


r.ooo 




L 


■ .775 


,893 


1.000 


, M 


.815 


.854 


.813 



1.000 



Mean r = .842 



Although the correlation coefficients in Table 4 show ..a generally 
high correspondence of score rankings , they do not take into account 
possible absolute differences in assigned scores — that is, any tendency of^ 
individual raters to score a given examinee performance more leniently^or 
more severely than th^ir colleagues--even though they are in agreernent 
on the relative rankings of the examinees. The quest^ion— of possible 
differences in absolute scores was investigated by comparing the mean 
score ratings (across examinees ) assigned' by the ! rat/ers in all three 
rating contexts; these results are shown in. Table 5. , 



-222- 



TABLE 5 



Mean Interview Ratings for Individual Raters 



TOEFL Interviews— Long (N=86) * 



Rater 


Mean Rating 


S.D. 


A 


2.47 




.82 


E 


2.67 1 




.82 


c ■ 


2.74 




.89 


D 


2.771 




.64 


B 


2.79 




.70 



TDEfL Interviews— Short (N=68) 



A 


2.41 


.83 


E 


2.48 


.85 


C 


2.5A 


.90 


B 


2.72 


.63 


a 


2.76 


.60 


New 


Jersey Interviews ''^=86) 


L 


3.70 


.93 


3 


3.721- 


' 1.19 


K 


3.97 


1 1.10 


M 


4.271- 


-' .80 



*Raters sharing a commor vertical line do not differ significantly in 
mean score (p > ,05). Raters i/ot joined by a line differ beyond p=.05. 



For the long TOEFL interview rat/ngs, the raters' mean scores ranged 
from 2.47 for the most severe i-ater/o 2.79 for the most lehient. Ranges 
ror the TOEFL short interview', and/for the New Jersey intervi,°w ratings 
"V^.^'"'!:'^'''^ ^"""^ 3.70-4.27, resp-ectively. The statistical significance 
or the difference in means between individual raters was determined 
through a series of t-tests for , correlat ed means. The results of these 
tests are shown in Table 5 b/ means of vertical lines. Raters sharing a 
vertical line were not fqi/nd to differ significantly in. mean assigned 
ratings, while significant differences were obtained betwern raters not 
sharing a line. 



-223- 



Alt hough these comparisons do show a number of statistically 
significant differences in the averages of the assigned ratings across 
raters, they do not of themselves provide a very useful or practical 
indication of the effect that scoring variability would be expected to 
have on the interview scores reported for individual examinees. This can 
be more readily determ'ined by analyzing, for each examinee in a given 
scoring study, the interview ratings actually assigned by the raters and 
presenting this information in the form of expectancy tables showing the 
-probability that an examinee whose reported score is at a given level 
would have a different scoring outcome if his or her performance had been 
evaluated by some other rater. . 

This approach is demonstrated in Table 6 for the New Jersey interview 
study. For each of three possible "passing score" levels shown in the 
table, observed frequencies and percentages of the same or different 
decisional outcomes are given. For example, if the passing score"level is 
hy pothet ically set at 2+ (i.e., if all examinees scoring 2+ or higher are 
considered accepted and all those scoring below 2+ considered reipcted), 
the middle of the three expectancy tables in Table 6 would be consulted; 
From these figures, based on the observed scoring performance of three 
additional raters beyond the .initial rater, it can be seen that 82.6 
percent of the additionally generated scores For examinees initially rated 
c\t level 2+ or higher were also 2+ or higher, and that 6.2 percent of the 
additional scores for examinees initially rated below level 2+ were nlso 
lower 'than 2+. By adding these two percentages (the upper left and lower 
right quadrants oP the table), it may be seen that 88.8 percent of the 
reratinr.s corroborated the initial decisional outcome as to acceptance or 
rejection rit a level 2+ cutoff. 

Percentages on the opposite diagonal indicate the proportion of 
roscorings in whic:^h the original outcome was not »iuplicated. . Specif- 
ically, I1..2 percent of the reratings for interviews originally scored 
lower than 2+ were 2+ or higher, indicating that, in these instances, 
there wci3 an 11.2 percent probability that the candidate would have had a 
favorable ("pass") outcome if he or she had been rated by another rater. 
Persons responsible for setting "passing" levels or making other kinds of 
decisions on the basis of the interview scores should take the nature 
and extent of scoring variability . into account: in the. example shown, 
consideration might be given to setting the pas'sing score slightly lower 
than the initially intended level, to minimize the possibility that 
examinees who do in fact have the desired level of proficiency would be 
improperly rejected as a result of scoring variability of the ir.terview 
process. 



Rela tionship of Interview Scores to O ther Indices of Language Competence 

In addition to long -and short interview scores for each examinee, 
avn liable IGFFL project cia t a included information on the instructional 
level of the Fnglish course to which the examinee had been assigned at Lhe 
ALP, performcmce on the ALP placement tes", and self-rating of speaking 
proficiency based on the regular interview scale. 



-224- 



TABLE 6 

Expectancy Tables for 
Three Passing Score Levels 
(New Jersey Data) 





Passinq Score: 


3 or Higher 




A. 


Number of Scores 


B, Percent of Scores 




Reported 


Other Raters* Scores 


Reported Other Raters* Scores 




Scores 


3 or higher lower than 3 


Scores 

3 or higher lower than 


3 


3 or Higher 


1// 3 


3 or Higher 68.6?o 1.2^ 




1 nwpr i"hpn ^ 




Lower cnan j i/,4/a iz.o/o 








Percent Agreement = 81.4?d 






Passing Score: 


2+ or Higher 




A. 


Number of Scores 


B, Percent of Scores 




Reported 


Other Raters' Scores 


Reported Other Raters* Scores 




Scores 


2+ or higher lower than 2>^ " 


Scores 

2+ or higher lower than 




7+ nr. HinHfsn 


? 1 "5 n ^ ^ ' 
^ J. ^ jj^^ 


or Higher bl,b% 0.0°g 




Lower than 2+ 


29 16 


1 nwpr than 7+ 1 1 7?n ^ 7°n 






i ' 


rrercent Mgreemenc = oo.o/o 






Passing Score: 


2 or Hiqher 




A. 


Number of Scores 


B. Percent of Scores 




Reported 


Other Raters* Scores 


Reported Other Raters* Scores 




Scores 


2 or higher lower than 2 


Scores 

2 or higher lower than 


2 


2 or Higher 


255 D 


2 or Higher 98.8?o 0.0?i 




Lower than 2 


0 3 


Lower than 2 0.0?o 1.2?o 





Percent Agreement = 100. 0?o 



-225- 



TABLE 7 

Correlaiions of Long and Short 
Interview Scores with 
Other Indices of Language Competence 



• (1) (2) (3) (A) (5) 

1. Instructional Level 

at ALP 1.000 .590 .558 ,610 .551 

2. ALP Placement 'Test 

Score .590 1.000 .348 .570 .707 

3. Self-Rating of 

Speaking Proficiency .558 .348 1.000 .479 .430 



4. Long Interview Score .610 .570- .479 1.000 .6i^6 

5. Short Interview Score.. .551 .707 .430 .696 1.000 



The correlation matrix for all five of these variables is shown 
in Table 7. The lowest of these correlations (.348) is significantly 
different from zero (p* C .01) and the highest correlations are well beyond 
.001.. Although the greatest evidence for the validity of ^the interview 
technique^. as a measure of real-life speaking proficiency is considered 
to reside in the face and content validity of the procedure and the 
associated scoring scale, intercorrelat ions of the obtained interview 
scores with other kinds of language proficiency measures can provide some 
corroborating evidence. 

With respect to the^^self-rating data, correlations of .479 and ,430 
for the long and short interviews, respectively, were found between^ the 
interview score results and student self-ratings of speaking ability using 
the regular FSI scale. Although these^ correlations are not extremely 
high, they suggest a clear positive relationship whose real magnitude is 
probably underrepresent ed to some extent as a function of measurement 
imprecision in both^ variables. Measurement precision of the student 
self-ratings could probably have been increased by allowing the students 
to indicate *'plus" ratings where applicable, rather than rating on only 
the- five broad numerical categories. In addition, simplification of 
and/or more detailed explanation' of the meaning of each score category 
would probably have been helpful, especially for the less competent 
students, who may have encountered some difficulty in reading the verbal 
definitions of proficiency with full comprehension. 



-226- 



Although this approach is not possible in operational interviewing 
situations, a more precise estimate of the "true" interview scores for 
individual examinees in the TOEFL study may be obtained by averaging each 
of the five scores assigned by the interview raters when relistening to a 
given interview. Intercorrelations of long and short average interview 
scores with the self -rat ings were found to be .560 and .554 — an increase 
over the .479 and .430 correlations with the single interview rating, and 
presumably more indicative of the true extent of the relationship between 
the two' variables after adjusting for the scoring unreliability of the 
interview. 

Furtner experimentation with student self -ra t ings^ as related to / 
obtained interview scores would provide extremely useful information about / 
both the basic validity of the proficiency interviewing technique and the^^ 
extent to which self-ratings of competence might in certain situations 
take the place of an externally administered interview. A, major caut/on 
in this regard is that the examinee should be in a position to give a 
frank and honest appraisal of his or her level of proficiency. For 
situations in which it would be la the candidate's advantage to profess a 
higher (or lower) degree of competence than is actually the case^^, the 
self-rating technique would be of questionable validity and usefulness. 

Another question of interest in the correlational, data is the 
extent to which interview ratings might be used in place of typical 
multiple-choice testing procedures for instructional placement purposes. 
As shown in Table 7, the ALP placement test (consisting of 60 listening 
comprehension questions and 120 questions bearing on English grammar 
and vocabulary) correlated .5.90 with the instructional (class assignment) 
levels of the examinees at the time of the interviewing study. Corre- 
lations of .610 and .551 were found between" the assigned instructional 
level and the long and short on-the-spot interview ratings. The three 
correlations do not differ significantly, indicating that both the long 
and the short interviews were able to predict assignment to instructional 
level as effectively as the multiple-choice placement test. Proponents 
of the interview technique might point out that even a quite abbreviated 
face-to-face interview lasting on the average only about six and a half 
minutes showed as much predictive power as the considerably longer 
and more time-consuming regular placement test. Proponents of more 
objective testing techniques might consider these results indicative of 
the extent to which testing procedures that do not require active speaking 
performance can substitute for direct measures in an operational placement 
context. 

i 
I 

i 

Length of Interview | 

The FSI-type interview is generally considered to require approx- 
imately twenty minutes of testing time for the majority of examinees and 
thirty minutes or more for examinees at the higher proficliency levels. 
Including the time required to greet the examinee at the! beginning of 
the interview and to determine and record the interview ratiing following 

■■ * / 



-227- 



the interview, the overall testing time can be expected to work out to 
about thirty minutes per examinee^ or no more than two examinees per hour. 
In light of the time and manpower requirements For interviews of the 
conventional length, there would be considerable practical value in 
reducing the total testing time per interview — provided this could be done 
without unduly affecting the face/content validity of the process or 
appreciably lowering the scoring reliability.. 

With respect' to scoring reliability, data from the TOEFL study 
comparing both intrarater reliability (Table 2) and interrater reliability 
(Table 4) of regular length and considerably shorter interviews demon- 
strated little if any reduction in the reliability coefficients for the 
abbreviated int er view format . As additional evidence, based on the mean 
interview rating across five raters, there was a correlation of .939 
between the long and short interview scores for the TOEFL examinees, 
indicating a very high degree of underlying correspondence in the two 
variables. Further analyses are planned to determine the possible 
existence of int eract ion effects between score levels and scoring 
reliability — for example, tne possibility that short interview scores are 
less reliably related to long interview scores at the upper end of the 
scoring scale ^-han they are in the lower and middle ranges of the scale, 
where judgments based on a less extensive speech sample are presumably 
easier to make. Pending the detailed results of these analyses, the 
overall correlations obtained between long and short interviews would 
suggest that, at least from the standpoint of scoring reliability, 
interviews based on^ appreciab ly shorter running times merit serious 
pract ical attention. 

With regard to the face/content validity of short er-t han-normal 
interviews (and including the psychological reactions of both interviewers 
and examinees to the reduced testing period), the TOEFL study interviews 
of approximately six and a half minutes average duration may be subject 
to discussion. Discounting the first half minute or so of both the long 
and short interviews, which is necessarily (ahd desirably) spent in 
greeting the examinee and exchanging a pleasantry or two, only about 
six minutes on the average were available under the short interview format 
for the interviewer to accomplish all the presumed necessary analytical 
tasks of the interview, that is, to establish the examinee's level of 
grammatical control, including tenses, agreements, and use of complex 
structures; extent of vocabulary as manifested in a variety of topical 
areas; and accuracy of pronunciation, overall fluency, and level of 
listening comprehension. Over the three-day interviewing period, many 
interv: ewers commented that, in the short interview situation, they would 
have liked to have had a bit more time with a number of the examinees and 
to have been able to ask a "few more questions" inorder to make what 
they considered an adequate and confident judgment of the examinees' 
proficiency levels. 

From the pc^^nt of view of the examinee (in an ot her-t han-experiment al 
setting), an interview lasting no more than five to seven minutes might be 
viewed as inappropriately and unfairly short. Even though an .accurate 



-228- 



rating might indeed be possible in this length of t.= me, the examinee could 
feel somewhat shortchanged in the conversational transaction and hence 
insufficiently probed as to overall, prof iciency , 

An approach that would appear to maintain much of the practical and 
economic advantage of a short interview and at the same time provide for 
greater interviewer and examinee satisfaction in the length and scope of 
the procedure (as well as more fully support the face/content validity of 
the interview process) would be to make use of a medium-length interview 
of perhaps ten to twelve minutes, to be. used with all but the most highly 
proficient examinees. Within this time period, and. assuming that conver- 
sational digressions and overly long exploration of individual topical 
areas were kept to a minimum, the interviewer should be able to obtain a 
sufficiently extensive language sample to make an accurate rating and at 
the same tip- rarry out a sufficiently wide-ranging conversation to 
satisfy the a: ' y^tive expectations of the process. 

If procedures could be developed to carry out the entire interviewing 
and rating sequence for a majority of examinees within a fifteen-minute 
rather than, a thirty-minute period, the total testing time for large 
numbers of examinees would be effectively halved, with concomitant savings 
in manpower and testing costs. For situations in which total testing 
time IS not a significant con.cern (as, for example, in relatively 
low-volume testing carried out on an as-needed basis by regular members of 
an institutional staff), twenty-minute or longer interviews could of 
course be utilized and justified on both measurement and economic grounds. 
In other situations involving large , numbers of examinees, outside inter- 
viewers, or other significant time/cost factors, ..a shorter interview 
format optimizing both validity/reliability and manpower/cost factors 
would merit serious consideration. Present indications from available ETS 
data are that a considerable reduction in total interviewing time should 
be possible without adversely affecting the scoring reliability or 
linguistic integrity of the process. 



\ 



PSYCHOPHYSICAL SCALING OF THE 
LANGUAGE PROFICIENCY INTERVIEV,' 

">A^^RELIMINARY REPORT 



Robert J. Vincent 
Central Intelligence Agency 



PSYCHOPHYSICAL SCALING OF THE LANGUAGE PROFICIENCY INTERVIEW 



Robert 3. Vincent 



Background i 

Few language teachers or researchers would.be expected to argue with 
the statement that for a given foreign language, a beginning student would 
experience more difficulty achieving a 3+ level on the eleven-point 
Foreign Service Institute (FSI) speaking proficiency scale than he would 
in reaching, say, the 2 level. But would the same teachers or researchers 
agree if asked to judge how much more difficult the 3+ level is to achieve 
than is the 2 level? 

How much consensus would there be to a more complicated set of 
questions? Which is more difficult to achieve, and how much more: 
reaching a 3 level from a 0+, or a 4 from a 3+? How long should the 
average student in each category be enrolled in training? Is it possible 
to project from known durations of training to situations where, as yet, 
no data exist? 

These and similar kinds of questions have cropped up time and again 
during a series of joint research efforts by the Psychological Services 
Staff (PSS) and the Language School (LS) to predict the speaking efficien- 
cy of language students at the conclusion of training. To be perfectly 
candid, we are rather proud of our ability to prognosticat e on the basis 
of selected linguistic and psychological variables. Yet one thing we 
have learned along the way: the only two variables common to all of the 
languages investigated thus far are duration of training . and speaking 
proficiency at the outset of training. 

These recurring findings, coupled with the thought provocations just 
advanced, have led to a search for a unitized measure or scale of the 
difficulty nf learning a foreign language. 



"The author wishes to recognize the unusual measure of support given 
this research by the judges and Language School management personnel. 
They gave generously of their time, patience, enthusiasm, and expertise 
(despite certain misgivings about the stability of the author for having 
them throw numbers about in such an unorthodox fashion). A special 
word of thanks is due Dr. Pardee Lowe, Chief of Testing of the Language 
School. Without his complete cooper at ion--wh ^ nh by now the author 
very much takes for granted—neither this nor any of the other joint 
research projects conducted over the past several years would have come 
to f ruition. 



-232- 



The need for scales ' traces its ancestry to the laboratories of German 
and French physicists. While it is true that the ancient Greeks sought 
laws relating the responses of man to the world around him., the Europeans 
made the first significant breakthroughs in relating sensory attributes 
such as loudness and brightness to their corresponding physical attri- 
t?utes: dynes per cfn2 and lambe'rts. These endeavors evolved into a 
branch of psychology referred to as psychophy sics . Stevens (1936), 
considered by many to be the father of modern psychophysics , embarked on a 
vigorous, forty-year program to scale a variety of sensory continua. A 
pralific and at times irascible spokesman , h is initial efforts were 
generated by a commercial requirement for a scale of subjective loudness. 
The physical (decibel) scale did not behave at all like its psychophysical 
(loudness) counterpart — simply put, 50 db does not sound half as loud as 
ipo db. Hence, the communication engineer needed a scale whose numbers 
m^de more sense to his customers than did the numbers on the decibel 
scale. . The result was'Uhe sone scale (Stevens, 1955) which was subse- 
quently adopted by the International Standards Organization to describe 
loudness for engineering purposes. 

Psychophysicists were content to occupy themselves with true sensory 
problems until the mid 1950s. By that- time they had reached general (but 
by no means universal) agreement on a psychophysical law: for nearly 
three dozen sense modclities (such as loudness, brightness, taste, . heavi- 
ness, judged intensity of electric shock, and so forth), equal stimulus 
ratios produce equal perceptual ratios. Expressed mathematically: 

where the perceived magnitude grows as the physical scale ^> raised to a 
power n. The is often thought of as a threshrild, while l< is merely a 
constant that depends upon the units employed. One particularly useful 
feature of this law is th'at when log f is plotted against log the 
resulting power function is a straight line. Mos t . importantly , each of 
the modalities abiding by the law seems to have a characteristic exponent 
(ri), ranging from 0.3 for brightness to 3.5 for apparent intensity of 
electric shock (Stevens, 1961). 

In the late 195Cs the psychophysical techniques that ha J been found 
to work so well on measurable, physical (rr.etric) continua began to be 
applied to stim.uli that could be described only en a nominal (nonmetric) 
scale--attitudes, verbal statements, occupations, crimes, punishment, and 
musical selections, to name, just a' f ew (Stevens, 1966) Interestingly 
enough, the psychophysical power law seems to have held. ^ Without some 
sort of metric, of course, the law could not be directly confirmed, but in 
the several instances where corresponding metrics were subsequently 
scaled, the relationship between judgments and physics entailed a power 
law. 

Given this background, it seemed worthwhile to bring the psycho- 
physical tools to bear on the matter of scaling the difficulty of learning 
a foreign language. This paper summarizes the extent to which this goal 
has been achieved. 



-233- 



Method 

Eighteen faculty members of the LS volunteered to participate in the 
research. Each was asked to judge the difficulty the ''average" LS student 
experiences in achieving the various speaking proficiency levels of the 
eleven-point FSI scale. The specific methods by which they went about 
this task are discussed in the next section. Suffice it to say at 
this point that judgments were restricted to the single' foreign language 
the rater considered to be his area of prime expertise. The language 
categories included French , Spanish, German, Russian, Chinese (Mandarin), 
Japanese, Swedish, Arabic, Turkish, Portuguese (Brazilian), and Indone- 
sian. Results from four participants were excluded from the analysis 
because the judges did not fully comply with the instructions, or because 
they were unable to conplete the task due to prior commitments. 

Two methods for judging the difficulty of learning to speak foreign 
languages were employed in the study. Copies of the instructions and 
response forms may be found in Appendix B. 

Phase 1--Maqnitude Estimation . The most direct and perhaps most 
efficient method to obtain an estimate of the relation between the FSI 
scale and judged difficulty attendant with reaching a particular FSI level 
is by means of magnitude estimation. The technique was employed as 
follows: a list of all eleven FSI levels was presented to each judge. 
Heading the list was a 2+ Cthe midpoint of the FSI. scale), which was 
referred to as the "standard." An arbitrary number of 10 was assigned 
to it to describe its relative difficulty to achieve at the conclusion of 
training. Each of the remaining ten comparison FSI levels (arrayed in a 
different randomized order for each participant) was then judged by having 
the participants decide what number should be assigned to describe its 
difficulty to achieve relative to the 2+ standard. For example, if a 
particular FSI level was judged to be three times more difficult than a 
2-*-, it received a value of 30. If another level was considered only 
one-tenth as difficult, it was called a 1, and so on. 

The method of magnitude estimation was deliberately chosen as the. 
lead-off technique because it is relatively straightforward and usually 
easily understood. Language School administrators had cautioned that some 
participants could be expected to experience difficulty interpreting the 
ins truct ions because English was not their native language. As it turned 
out, few participants voiced any concern whatsoever, and nearly all 
completed Phase 1 in the allotted time of fifteen minutes. Several judges 
did express .te'servat ions, noting that they disliked working with numbers 
and that th^-ir results would be meaningless (a typical reaction in this 
kind of research). Nonetheless, they were encouraged to try and, with 
few exceptions, produced results entirely in keeping with those of the 
rema in ing judges - 



-234- 



Phase 2- - Ratio Estimation , A second psychophysical technique' was 
employed for several reasons. In the first place, despite the preliminary 
nature of this research, some means for independent verification of the 
results seemed to be In order. Second, the magnitude estimation method 
.was limited by virtue of the fact th'at, as it was employed in this study, 
it focused on the "average" student's exit proficiency (that is, his 
FSI rating at the conclusion of training). Since it did not directly 
account for the feet that students can enter training at any FSI level 
(enter' proficiency), the judges ^were left with the following options: 
either restrict their judgments ito the case where ent/er proficiency was 
assumed to be 0, or somehow mentally average across^ all possible enter 
proficiencies to arrive at a/sing le number . app/op ri ate to the exit 
proficiency ih question. / / 

' ■ / • > ^ 

The method of ratio eTstimatipn solved b(Ah problems. If indeed 
judged difficulty obeys ^'he power law, both/'' psychophysical techniques 
should produce similar r^^ults, with one serv/ng as a check on the other. 
Moreover, the ratio estimation technique r/quired , the judges to assign 
numbers to all possi^e combinations of pairs of enter and exit profi- 
ciencies (excluding t;tiose cases where they6nter proficiency scores equaled 
or exceeded exit proficiency scores). An enter score of 1+ and an exit 
score of 3 were ^osen to represent the sterdard of IC. All remaining, 
randomized pairings were then judged relative to the standard pair. The 
•judges were simply instructed to assign to the comparison pairs numbers 
proportional to the relative difficulty of the standard pair. Whereas 
maghitude estamat ion involved only exit proficiencies, ratio estimation 
was concerned with pnirs of proficiencies. Otherwise, the scaling tech- 
niques were similar. 

For the r^-icord, the judges found the ratio estimations much more of a 
challenge, and several took the- opportunity to say" so in no uncertain 
terms. If their magnitude estimates were meaningless, they noted, t'heir 
ratio estimates had to be worse. As before, the experimenter attempted 
to assuage their concerns and asked them to do their best. Although 
most judges completed the task in the allotted forty-five minutes, some 
required twice as much time. 



Results and Discussion . /' 

The experiment was Expressly designed so as not to constrain the 
participants' definition of what constituted difficulty of learning to 
speak a foreign language. As a case in point, no mention was ever made by 
the experimenter that one way i' to assess the relative difficulty of the 
various FSI proficiency levels/ would be to compare the average durations 
of training associated with eath combination of enter and.exit proficiency 
ratings. Indeed, both the^/formal instructions as well as the informal 
introductory remarks stressed that difficulty was a judgment and that its 
definition probably varied from person to person and language to language. 
The experimenter expressed sympathy with how strange it must seem to be 
asked to assign numbers to such a nebulous dimension. Interestingly 




-235- 



enough, not one participant volunteered that estimated duration of train- 
ing constituted the basis for his Judgments of difficulty (although that 
in no way discounts the possibility that duration was, in fact, the 
basis) . 

In any event, when the judges* estimates of difficulty of achieving 
each rSI level were compared to the average duration of training required 
to achieve that level, the resulting functions offered surprisingly strong 
con f i rma ^ ion of the psychophysical power law (Figure 1).2 As a matter 
of f act ^ judged difficulty was described by both psychophysical methods 
as being directly proportional to the duration of training. 3 in 
ma thema t ical terms , 



where ^ refers to estimated difficulty, j< is a constant with a value of 
• 01 or .03, depending on the psychophysical technique, <\ is duration of 
training, is a constant with a value of 37 for magnitude estimation and 
0 for ratio estimation, and n = 1.00 for magnitude estimation and 1.03 for 
ratio estimation. 

Adhering to standard procedures for handling highly variable .data of 
the type found in psychophysical studies (Stevens, 1960), geometric means 
reithep than arithmetic means were calculated for each enter and exit 
p'rof /c iency combination. This was true for both the judges* estimates of 
difficulty aiid the empirical durations of training. 

/ fable 1 summarizes the duration of training data for the six lan- 
.quages in the data base. It should be mentioned that for the higher 
'enter/exit combinai: ion 3 , few data points were available for use, and 
inspection of fable i reveals that no data whatsoever existed for the 
categories beyond 3+. Security considerations prevent disclosure of the 
numbers of students or measures of the variability of the data falling 
within each category. 

Statistical procedures formulated by Ekman (1961), Mashhour (1961), 
and forgerson (1958) were followed in deriving the two psychophysical 
scales. The power functions were calculated exclusively on the basis 
of training duration data found in the 0 through 3+ FSI ca tego r ies ; 
training durations associated with the 4, 4+, and 5 levels were then 
projected on the basis of the resulting power functions (and shown as 
fill^ delta points in Figure 1).- 



2Duration of training data were compiled from the PSS computerized data 
base for LS students enrolled since 1969 in French, Spanish, German, 
Russian, Chinese, and Japanese. 

35ee tNote 1, Appendix A. 



ESTIMATED DIFFICULTY OF 

ACQUIRING FOREIGN LANGUAGE = kfHOURS - al" 
SPEAKING PROFICIHNCy 

VIA MAGNITUDE ESTIMATION = .03(HOURS - 57)^'°° 

mum ESTIMATION = ."OlfHOURS • 00]^''^^ 

(•-A] . = PROJECTED DATA POINTS • 




HOURS IN TRAINING 



-237- 



TABLE 1 

Consecutive Weeks in Language Training* 
(Empirical Data) 

ENTER PROFICIENCY 



0 0+ 1 1+ 2 2+ 3 3+ 4 4+ 





0+ 


2.4 














f: 


















X 


1 


6.7 


2.7 












I. • 


















T 


1+ 


12. 1 


4.5 


4.1 










P 


'■ 2 , 


13.9 


10.5 


7.2 


4.4 








R 


















0. 


2+ 


17.7 


26.2 


9.4 


6.1 


3.9 






F 


















I 


3 


16.5 


16.6 


9.9 


9.7 


9.4 


4.5 




C 


















I 


3+ 


29.0 


11.5 






16.5 


4.2 


3.1 


E 


















N 


4 
















C 


















Y 


4+ 


















5 

















*Based upon data available on French, Spanish, German, Russian, Chinese 
(Mandarin), and Jjapanese training programs. 



The specific difficulty scale derived from magnitude .est imations was 
found to be: 

Estimated Difficulty = 

.03 (Hours in training - 37 hours)^'°°. 
The comparable function for the ratio estimations was: 
Estimated Difficulty = 

.01 (Hours in training)^ ' 



-238- 



Recall from an earlier discussion that the .03 and .01 values are 
simply constants that move the functions up and down the scale according 
to the units of measurement chosen by the judges. Beyond that, they are 
of little interest to the discussion at hand. 

The thirty-seven hour figure in the magnitude estimation function is 
another constant, and is often thought of as a noise threshold in the pure 
psychophysical studies (although even there its, lineage occasionally is 
indeterminate). ^ Mathematically, it serves to straighten out an otherwise 
curvilinear function. Whereas no such constant was required for the ratio 
estimation data, inspection of Figure 1 reveaii: that the magnitude esti- 
mation function would have been markealy curvilinear had not the constant 
been taken into account. For present purposes this additive constant will 
be viewed as a statistical expedient for curve fitting purposes, since the 
areas of prime interest rest with the overall relationship of judged 
difficulty to duration of. training, and especially the slopes of these 
linear relationships. 

But we would be remiss not to point out (at least parenthetically) 
that the thirty-seven hour constant is nearly identical to the average 
number of hours spent in training by those LS students who entered at, but 
were unable to progress beyond, the 0 proficiency level. 

Note also in Figure 1 that the corresponding FSI ratings have been 
plotted along the estimation axes. These results are interpreted as 
follows: according to the magnitude estimation scale, ^an FSI level of 5 
was judged to be about .85 times more difficult to achieve than a 0+, but 
or.ly twice as difficult as a 4. Looking over to the ratio estimations, a 
5 was estimated to be about 240 times .more difficult to reach than a 0+, 
and more than 8 times more difficult than a 4. In other words, although 
the overall relationship between judged difficulty and duration of train- 
ing obtained by two procedures was described by nearly identical power 
functions, the respective ranges of difficulty and the distribution of 
FSI levels within each range differed according to the psychophysical 
technique chosen. The differences between the two techniques ate most 
striking at the 4 and higher levels. The magnitude estimation scale 
suggests that a student can achieve a 4 rating in approximately 3,250 
hours (about 88 weeks), whereas the ratio estimation scale projects nearly 
18,000 hours (or nearly 9.5 years ). It is doubtful that many instructors 
would be as optimistic as the magnitude estimation projection, and the 
ratio estimation pro jection ; may be too low as well. But at least it 
squares with the opinion of some linguists that languge proficiency is 
fairly well, established by the age of ten (Chomsky, 1968). 

In any event, results discussed thus far appear to have satisfied two 
of the goals set forth for this research: scaling difficulty of learning 
a foreign language, and relating this difficulty to duration of training. 



-239- 



Carrying the analysis a step further, it was possible to use the 
power fun' Ions to project the average number of hours in training for 
every combination of enter and exit proficiency. Two such .pro ject ions 
have been made. The left-hand and center scales of , F igure 2 show once 
again the relationship of FSI levels to estimated difficulty. These 
results came from the Phase 2 (ratio est imat ion) study and are identical 
to those depicted on the right side of Figure 1. Magnitude estimation 
data could have been used as well, but they were not, owing to the abbre- 
viated range of judgments and the fact that, as ment ioned earl ier , such 
estimates were based upon overall estimates of the difficulty of exit 
proficiency rather than upon pairs of enter and exit prcf ic ienc ies . 

The right-hand scale is an artificial difficulty scale specifically 
calculated to even out the differences found among the various FSI levels 
on the rat lo. est ima t ion scale. For 'example, the original scale (left 
side) indicates that the difference between a 4+.and a 5 is considerably 
larger than, say, the difference between a 4 and a 4-h, despite the fact 
that the results are already plotted on a logarithmic scale (which would 
(juarantee that even if the three levels had fallen equ id istantly from one 
another, the relat ive d if f icult les would increase logarithmically). Four 
possibilities can be thought of as accounting for these disproport ion- 
alities: (1) the difficulty estimates are accurate — a 4+ is in reality 
very much less difficult to reach than is a 5, but only moderately 
more difficult than is a 4; (2) ^the judges had trouble estimating the 
difficulty of the levels, especially the mid- and upper-range levels; {3) 
the variations among levels could reflect how appropriately the judges 
regarded and were able to use numbers and ratios; or (4) r,^?ne combination 
of these factors was at work. 

While the last (or compromise) hypothesis probably covers all the 
bases, the second hypothesis more than likely focuses on the single 
most significant contributor to respose variability. Nearly all judges 
remarked that they had never trained an adult student beyond the 4 or 4+ 
level (in some cases, beyond a 3+), and therefore could not imagine how 
difficuJt a task it would be, assurriing that it were at all possible. 

Although there is no a priori basis for accepting either scale as it 
applies to the higher FSI levels (recall that no datja existed in our 
computerized records for the 4 through 5 levels) each scale could be 
compared to the empirical data base in the 0 through 3+ levels (Table 
1). Such ci comparison presumes acceptance of the data base as representing 
le?arnincj to speak a foreign language in general, despite the fact that 
nome data points were based upon very small numbers of students (who 
themselves; may or may not have been representative of students in gen- 
eral). In addition, all data points reflect most heavily t^he influence 
of students of French and Spanish, less heavily German, Russian, Chinese, 
and Japanese, but no other languages. .About all that can be said in 
defense of the data base is that it represented the totality of infor- 
mation nn duration of training available at tho time this study was 



-240- 

FIGURE 2 / 
Original and Revised Judged Difficulty Scales 

(see text ) 



4+ 



3+ 
3 



2+ 

2 

1+ 



0+ 



Original Judged 
Difficulty Scale 



100 



50 



4+ 



3+ 



2+ 



1+ 



0+ 



Revised Judged 
Difficulty Scale 



-241- 



conducted. To the extent that it does adequately reflect how long the / 
average student spends in training, it c->n be expected to provide usefuF 
results. 

To this end, projected durations of training for each combination of 
enter and exit speaking prc^iciencies were compiled according foThe ratio 
estimation power function. The results an- displayed in Table 2, with 
the original scale results posted at the top and the revised (log equi- 
distant scale) results at the bottom. A comnarison of these results 
with the empirical data in Table 1 is summarized in Figure 3. With a few 
rather conspicuous exceptions (such as 0+ to 2+ and 0+ to 3+) , the judges' 
original estimates were reasonably accurate reflections of the actual 
durations of training in each enter/exit category. On the average, the 
original scale overestimated duration of training up to the 3+ level by 
approximately 1.3 weeks, whereas the revised scale overestimated training 
duration by more than 17 weeks. In short, the results support the con- 
tention that FSI levels are not spaced equidistant ly along a logarithmic 
scale. Some levels( are very much more or less difficult to achieve than 
would be predicted^'by a linear or logarithmic projection. 

Finally, in/answer to the question posed earlier (Which is more 
difficult' to acpeve, and how much more: reaching a 3 level from a 
0+, or a 4 from /a 3+?), note once again the top portion of Table 2, The 
projected duration for the former case is twenty-two weeks, compared to 
thirty-four weekfe for the latter. Thus, progressing from a 3+ to a 4 is 
.projected to take 1.5 times longer than advancing from a 0+ to a 3. 
"ihe empirical training data (Table 1) led to a dead end, since no 4+ data 
are cited. However, some last-min ite detective work uncovered the records 
of several students who satisfied the 3+/4 requirement. Their average 
duration of training was, surprisingly, only 18.5 weeks, resulting in a 
1.1 to 1 ratio for the empirical data. In either case, 3+ to 4 shows 
every indication of being more difficult to achieve than a 0+ to 3. 



See Note 2, Appendix A. 
See Note 3, Appendix A. 



ERIC 



TABLE 2 

Projected Consecutive Weeks in Language Training* 
(Based Upon Judged Difficulty = .01 [ Hours] 1 .03)** 

ENTER PROFICIENCY 
0 0+ 1 1+ 2 2+ 3 3+ 4 



4+ 





0+ 


2.4 














r— 

E 


















X 

T 
1 


1 


4.6 


2.2 












1 


1+ 


7.0 


4.6 


2.4 










P 


2 


10.8 


8.4 


6.2 


3.8 








R 


















0 


2+ 


13.9 


11.6 


9.4 


7.0 


3.2 






F 


















T 


J 




9 9 n 
ZZ • u 


1 Q D 


1 "7 /i 

17.4 


13.6 10.4 






c 


















I 


3+ 


2B.8 


26.5 


24.2 


21.9 


18.1 14.9 


4.5 




E 


















N 

\1 


4 


62.9 


60.5 


58.3 


55.9 


' 52.1 48.9 


38.5 34.0 




Y 




11-1 /i 


1 no n 


106.8 


104.4 


100.6 97.4 


87.0 82.5 


48.5 




5 


484.6 


482.2 


480.0 


477.6 


473.6 470.6 


460.2 455.7 


421.7 373.2 






A. 


Based 


upon Original 


Judged Difficulty Scale 














ENTER 


PROFICIENCY 










0 


0+. 


1 


1+ 


2 2+ 


3 3+ 


4 4+ 




0+ 


2.4 














t 


















X 

T 


1 


4.3 














I 

1 


1+ 


7.7 


5.4 


3.4 










p 




13.9 


11.6 


9.7 


6.2 








R 


















0 


2+ 


25.2 


22.8 


20.9 


17.5 


11.2 






F 


















r 

i. 




45.5 


43.1 


41.2 


.37.8 


31.6 20.3 






C 


















I 


3+ 


82.2 


79.8 


77.9 


74.5 


68.3 57.0 


36.7 




E 
















N 


4 


148.5 146.2 


144.3 140.8 


134.6 123.3 


103.0 66.3 




C 


















Y 


4+ 


268.3 266.0 


264.1 260.6 


254.4 243.2 


222.8 186.1 


119.8 




5 


484.8 482.5 


480.6 477.1 


470.9 459.6 


439.3 402.6 


336.3 216.5 



B. Based Upon Revised Judged Difficulty Scale. 



♦Based upon estimates of difficulhy by instructors in French, Spanish, 
German, Russian, Chinese (Mandarin), Japanese, Portuguese (Brazilian), 
Swedish , Turkish, Arabic, Indonesian. 

**Duration estimates based upon data available on French, Spanish, German, 
Russian, Chinese (Mandarin), and Japanese training programs. 

/ 



-243- 



FIGURE 3 

Deviation of Projected from Empirical Duration of Training 
(for enter/exit pairings, 0 through 3+) 



(U 
(U 
2 



C 



C 

o 



CO 
Q 



CO 

o 



u 

o 
Q- 



60 



30 



20- 



10 



-10- 



n — I — I — I 1 — 1 — I — I — r- 



"T— I — I — I — r- 



-I — I — I 1 — I — r 



Q — 

6 ° 



o 



o 



T^s^ O 



O 



O 



o 



TJ 



O 



O 



CIA) 

-1 1 1 I » 



-20-' 

KXrr 0+1. 1 + 2 2+ 3 3f 1 L + 2 2+1 3+ L + 2 2+ 3 3v 
0 0+ 1 



(HA) 

-i 1 ■ « 



o 



o 



2+3 3+ 2+3 3+ 3 3+ 3+ 
1+ 2 2+3 



O = TiTiiial rutin r-.; L ir .i t ion -.calf ;x - +l.,?n v;oek3), 
/•j, - n.'vi ,-•! riti(-) f.'Stii. atinn r.c-il - (7 = +17.03 v.'o.ji-.s ) 



ERIC 



-244- 



Conclusions and Recommendations 

Preliminary though they may be, the results rather strongly suggest 
that judged difficulty in learning to speak a foreign language can be 
sca.led, and that difficulty is directly related to duration of training by 
the psychophysical power law._ Since the law permits one to state that 
equal stimulus ratios produce equal perceptual ratios, it was possible to 
apply the judges' estimates to projections beyond available data, thereby 
generating a complete matrix of duration estimates for all pairs of enter 
and exit speaking proficiencies. It was further concluded that the 
estimated difficulty of achieving sequential FSI levels is not a straight- 
forward progression. Some levels, especially those beyond the 2 or 2+ 
level, seem to require disproportionate amounts of training. 

In recognition of the preliminary nature of this study, it is 
recommended that further work be pursued, with particular emphasis on 
enlarging the data base to include a wider selection of languages; filling 
in the gaps in the empirical duration-of-training data base; determining 
If individual languages obey, the power law and, if so, grouping them 
according to their relative judged difficulty (and comparing the resuming 
groupings with those currently available); and, finally, calculating the 
judged difficulty of learning to read and understand foreign languages 



EKLC 



-2^5- 

Appendix A 
NOTE 1 

Psychophysical Power Law 



« k($ ~ (fo)"^ 



where:. 

'\! = Perceived Magnitude (Judged Difficulty) 
k = Constant 

<t> = Physical Magnitude (Duration of Training) 

- 4>o = '^Threshold" (Statistical Expedient) 

n = Exponent (Unique to Given Modality) 



-2A6- 

NOTE 2 
Ratio Estimation 

Example : 

S.ENTER = 2+, S.EXIT = 4 



11/ / 4 d: 

4 = / 2+-'-4 + 0-'-2+ 



2+ A 0-^2+ 



where: 

4- 



n 



4 = Judged Difficulty: S.EXIT = 4 

'''2-f = Judged Difficulty: S.ENTER = 2+ 

*2+-^4 = Duration: S.ENTER = 2-!-; S.EXIT = 4 

V2+ = Duration: S.ENTER = 0; S.EXIT = 2+ 
n = 1.03 



"247- / 

NOTE 3 
Relative Difficulty 



Compare; 

S.ENTER = 0+; S-EXIT =3 

S.ENTER = 3+; S.EXIT = 4 ' 




/• 



where: 



T 


= Judged Difficulty: 


S 


.ENTER : 


: 0+; 


S 


.EXIT : 


: 3 




= Judged Difficulty: 


S 


.ENTER : 


: 3+; 


S 


.EXIT : 


= 4 


CH-3 


= Duration: 


S 


.ENTER : 


: 0+; 


S 


/ 

.EXIT : 


= 3 




=Duration: . 


S 


.ENTER : 


: 3+; 


S 


.EXIT : 


: 4 



n = 1.03 



ERIC 



-248- 



/ Appendix B / 

Instructions, Phase. 1 



On the next page is a list of speaking exit proficiency ratings 
Your task is to judge the difficulty you would expect the average LS 
! rr^ f'^P^"^"'^^ achieving each rating. You are to exoress this 

difficulty by assigning numbers to the ratings. The first rating, a 2+ 
IS to be called "10." Thereafter, you are to assign numbers proportional 
to your subjective impression of this first rating. For example, if you 
feel a particular exit rating is twice as difficult to achieve as a 2+ 
assign to it a number "20." If you judge another to be one-fifth as 
difficult, call It "2," and so forth. Please do not restrict your re- 
sponse range. Use numbers as large or as small as you feel are necessary, 
including those less than "1" (fractions or decimals) if they are appro- 
priate. " Base your judgments on a specific foreign language with which you 
have had extensive teaching experience. Please note at the bottom of the 
lisv. which language you had in mind. 



.-249- 



ERIC 



SPEAKING 
EXIT ~ 

2+ 10 

2 

0 

1+ 

2+ 

3 [ 

04- 

4+ 

5 

1 

4 

3+ 



NAME 



DATII 



LANGUAGE 



-250- 



Instructions, Phase 2 

On the next page are pairs of speaking enter and exi^ proficiency 
ratings. Your task is to judge how difficult it wouJd be for a typical LS 
language student to achieve each exit proficiency score given its paired 
ente£ proficiency score. You are to express this difficulty by assigning 
a number to each pair. The first pair of ratings, 1+ and 3, is to be 
called "10." Thereafter, you are to assign numbers proportional to your 
subjective impression of this first pair of ratings. For example, if you 
feel a particular pair of ratings is twice as difficult to achieve as the 
1+ and 3 pair, assign to it a number "20." If you judge another pair to 
be one-fifth as difficult, call it "2," and so forth. Please do not 
restrict your response range. Use numbers as large or as small as you 
feel are necessary, including those less than "1" (fractions or decimals) 
if they are appropriate. Base all of your judgments on the same foreign 
language you chose in Phase 1. Please make note of this language at the 
bottom of the list. 



-251- 

SPEAKING 



Enter Exit Difficulty 

1+ 3 10 

3 A+ 

0+ 3 

2+ 3+ 

0 0+ 



0 1 
1+ A 
2 3+ 

1 ?. 

2 A 

0+ 2 

2 2+ 

1 5 

3 3+ 

0 ' 2 

2 A+ 
0+ 1+ 
1+ 2 

1 1+ 

2 5 

0 3 

3+ A+ 

1+ 3+ 

2+ A+ 

2 0 3 

0 A 

0+ 2+ 

3+ 5 

2+ 4 

4 5 

3+ 4 

0 A+ 

3 A 

4 A+ 
l+. A+ 

0+ 3+ 

1 A 
1+ 2+ 
D 1+ 
1+ 5 



(Continued on page 252) 



< \ 



-253- 
References 



Chomsky, N. Language and Mind . New York: . Harcourt, Brace, Jovanovich, 
1968. 

Ekman, G. "A Simple Method for Fitting Psychophysical Power Functions." 
Journal oF Psychology ^ 51 (1961): 343--5D. 

Mashhour, M. "On the Validity oF Scales Derived by Ratio and Magni*:»ide 
Estimation Methods." Psychological Laboratory, University oF 
Stockholm. Technical Report No. 105, 1961. 

rtevens, S. S. "A Scale For the Measurement oF a Psychological Magnitude, 
Loudness." Psychological Review , 43 (1936): 405-16.^ 

"The Measurement oF Loudness." Journa ]/of the Acoustics] 



Society oF America , 27 (1955): 815-29. 

. "The Psychophysics oF Selnsory Function." American 

Scientist , 48 (1960): 226-253. 

. "To Honor Fechner and Repeal His Law." Science, 133 

(1961): 80-86. 

' . "A Metric For the Social Consensus." Science, 151 (1966): 



530-41. 

Torgerson, W. S. Theory and Methods oF Scaling . New York: Wiley, 
1958. 



SETTING STANDARDS OF SPEAKING PROFICIENCY 



Samuel A. Livingston 
Educational Testing Service 



SETTING STANDARDS UF SPEAKING PROFICIENCY 



\ 



Samuel A. Livingston 

In our society we set standards for all kinds of things. The Food 
and Drug Administration 3efG standards for the purity of food products. 
The Environmental Proteption Agency sets standards for the cleanliness of 
automobile exh aus t f umes . And the New Jersey Department of Education 
sets standards for the speaking proficiency of teachers — in particular, 
teachers of English as a second language (ESL) and teachers of Spanish- 
English bilingual classes. A standard is simply an answer to the ques- 
tion: "How good is good enough?" Any answer to this question must 
involve judgment. Thtiefore, anyone who sets out to do a standard-setting 
study must answer four basic questions: 

1. What type of judgments will enter into the standard-setting 
process? r 

2. Who will make those judgments? 

3. How will the judgments be collected? 

4. How will the judgments be used to determine the standard? 

The purpose of this paper is to show how each of these four ' questions was 
answered in a standard-setting study conducted for the New Jersey Depart- 
ment of Education by Educational Testing Service. The Department of 
Education uses the Language Proficiency Interview (LPI) as a measure of 
speaking proficienc> in certifying persons as eligible to teach ESL and 
Spanish-English bil ingual. classes. The standard-setting study was 
intended to help the Department decide what interview score level to 
establish as the minimum for cert if icat ion for these teaching positions. 

Of the four basic questions listed above, the first question — what 
type of judgments to use--is the most basic. In the case of the LPI, 
there are at least two ways to ansver the question. One way is to use 
judgments made on the basis of the written statements that express the 
meanings of the various interview score levels. Another way is to use 
judgments of the actual interview performances of persons applying for 
vcertif ication. As the semanticists like to remind us, the word is not the 
-thing; the wr i 1 1 en desc r ip t .ion of performance is not the performance 
itself. Therefore', we (that is, researchers from Educational Testing. 
Service and administrators from the Department ^ of Education) decided 
to base the standard-setting on judgments of tfic actual interview per- 
formances of ■ individual candidates for certification: judgments of each 
speaker *s proficiency as adequate or not adequate for Ihe job in question 
(bilingual or ESL teacher). 

The second quest ion--whose judgments to use--depends partly on 
the types of judgmentvS to be used. Our main concern was to choose a group 
of judges who would be representative of the population of persons 
qualified to judge a candidate's speaking proficiency as being adequate 



EKLC 



-258- 



□ r inadequate for the job of a bilingual or ESL teacher. The Department 
of Education recruited three groups of judges, one group for each of 
three types of judgment: 

1. English-language proficiency for ESL certification 

2. English-language proficiency for bilingual certification 

3. Spanish-language proficiency for bilingual certification 

The judges were all experienced teachers (and in many cases also super- 
visors of teachers) of ESL or Spanish-English bilingual classes. 



ColJ^fectinq the Data 

The third quest ion--hoK. to collect the da ta--in volved a number 
of specific decisions. Considerations of scientific method entered 
into these decisions, as did administrative considerations. One important 
question of research design was how long a segment of each interview 
to. present to the judges. Since the amount of ,time the judges could 
dev'jte to the study was limited, we had to make a trade-off between two 
important considerations: getting a valid judgment, of each interview 
piesented and getting judgments of an adequate number of interviews at 
each score level. From a statistical point of view, if the total listen- 
ing time IS limited, the segments should be of the shorte^^t length that 
will allow a meaningful judgment, so as to permit the judging of as many 
different interviews as possible. We decided to use five-minute segments, 
which enabled us to get judgments of twenty different interviews, (On 
the basis of our experience with this study, we now believe the judges 
could have made meaningful judgments of segments much shorter than 
five minutes. ) 

A related question is how to select the segment of each interview ^to 
present for judging. Experience with the LP! suggests th;at the portion 
of the interview that yields the most information about the examinee's 
strengths and weaknesses begins about thirty seconds after the opening of 
the interview. The opening thirty seconds usually consist of conventional 
greetings and simple introductory questions. During the following five 
mlnutes--the portion used in the study--the interviewer typically asks 
questions aimed at exploring the examinee's command of verb tenses and 
ability to communicate on several topics: personal and family background, 
personal activities and interests, teaching assignments, classroom 
activities, philosophies of education, and so on. 

Another important question is the range of score levels to be 
represented in the study. Reducing the number of score levels allows 
more interviews nt nach of the remaining levels, but it is important not 
to exclude any levels that might turn out to be near the standard. 
We eliminated levels 0 and 0-^ and li.wel 3, assuming that almost no 
level 0 or 0+ interviews would be judged adeqjate and that most level 5 



-259- 



interviews would be judged adequate. This decision enabled us to present 
three interviews at all but one of the remaining seven score levels. 
Level 4+ was represented by only two interviews, instead of three. 

We decided to use the same sample of English-language interview 
segments for both the ESL and English-bilingual judging. This decision 
enabled us to make direct compa r isons between the ESL and English- 
bilingual judgments. It also simplified the data collection procedure. 

To avoid "sequence e f feet s " --sy s tema t ic trends in the sequence 
of score levels of the interview segments that might bias the * judgments — 
we used the following procedure. First, we divided the twei'ihy interview 
segments into three subsamples so that each, subsamp le contained an 
interview segment at every score level (with one exception: level 4+ was 
not represented in the last subsample). We then randomized the order of 
the score levels in each subsample, using a different random sequence for 
each subsample. This procedure produced Lhe following sequence of score 
levels: 2, 4, 4+, 3, 1 + , 2+, 3+, 3, 4+, 3-f, 2+, 2, , 1 + , 2+, A, 2,, 3+, 
l+, 3. We used the same sequence for both the English-language interviews 
urn! the Spanish-language interviews. 

The actual judging took place at the language laboratory of Rider 
College in Trenton, New Jersey. Eight ESL judges, eleven English- 
bilingual judges, and eleven Spanish-bilingual judges participated. The 
judges received instructions emphasizing that their task was to judqe 
whether the speaking proficiency of the person being interviewed in each 
segment was "at least minimally sufficient for this person to function 
adequately" in . the relevant teaching job. The judges listened to the 
tnp^Ml interview segments through earphones at individual listening booths, 
fhey were instructed not to communicate with each other during the judning 
process or to give any audible or visible reaction to the interview 
secj.nerjt s . 



Analysis of the Data 

Our data analysis was intended to take the information contained 
in the individual judgments and summarize it in such a way that it would 
be as useful as possible for setting standards. Therefore, we tried 
to present the results of the judging in a way that would answer the 
question: "Given a candidate's interview score, what is the probability 
that the candidate's actual speaking proficiency would be judged accept- 
able?" Another way to express this question is to ask, "If all interviews 
at a given score level were judged by all possible judges, what percentage 
of the resulting judgments would rate the candidate as acceptable?'^ We 
sought to answer this question for the English speak ing prof iciency 
of ESL teachers, the English speaking proficiency of Spanish-English 
bilingual education teachers, and t^ ^ Spanish speaking proficiency of 
Spnnish-English bilingual education teat rs. b 



-260- 



Enqlish as a Sec ond Language . The results of the judging of the 
English-language interview segments by the eight ESL judges are shown 
in Table 1 and presented grapliically in Figure 1. Table 1 shows what 
percentage of the judges rated each interview segment acceptable, as well 
as the average! of these percentages for all the interview segments at each 
LPI score level. For example, of the three interview segments at score 
level 3, the first was considered acceptable by 25 percent of the judges; 
the second, by 88 percent; and the third, by 100 percent. The average of 
these three percentages is 71 percent. This average can be interpreted as 
an estimate of the probability that a randomly selected level 3 interview 
would be rated as acceptable by a judge selected at random from the 
population of all possible ESL judges. Note that in Table 1 these esti- 
mates increase steadily ; rom zero at level 1+ to 100 percent at level 
4+. 

. The fact that fourteen of the twenty interview segments were judged 
acceptable either by none of the ESL judges or by all of the ESL judges 
indicates a high degree of consistency. In fact, for seventeen of the 
twenty segments, at least seven of the eight ESL judges were in agreement, 
even though they made their judgments independently, without any communi- 
cation with each other. 

Figure 1 provides a graphic presentation of the information in Table 
1. The dots represent the percentages of acceptance for the individual 
interview segments. Horizontal lines have been drawn at 0, 50, and IGO 
percent to make the graph easier to read. For the same reason, vertical 
lines have been drawn to connect the dots representing interview segments 
at each score level. The average percentage of acceptance at each score 
level is indicated by a short horizontal line. Notice that the average 
percentage of acceptance rises steadily from level 1+ to level 4+ in such 
a way as to suggest a smooth curve. If such a curve were drawn on the 
graph, it wou]d cross the dashed line indicating 50 percent acceptance 
somewhere between level 2+ and level 3. 

Ehqlish-Bilinquai . Table 2 and Figure 2 present the results of 
the judging of the English language tapes by the English-bilingual 
judges. These judges also appear to have been quite consistent in their 
evaluations (though not quite as consistent as the ESL judges). The 
average percentage of acceptance of the English language interviews is 
consistently higher for the English-bilingual judges than for the ESL 
judges. This result suggests that the teaching of English as a second 
language requires a higher level of English-language Sf)oaking proficiency 
than does the teaching of bilingual education classes. 

.The average percentage of acceptance by the English-bilingual 
judges (like that by the ESL judges) increases steadily with increasing 
score levels, from 21 percent at level 1+ to lOP percent at levels 4 and 
4+. A smooth curve connecting these points in Figure 2 would cross the 
line representing 50 percent acceptance slightly above score level 
2 (rather than between 2+ and 3, as was the case for the ESL judgments).". 



-261- 
TABLE 1 

English as a Second Language 
(8 judges) 

Interview 

Score Levfel Percentage of Judges Accepting Interview Segment 

Tape 1 Tape 2 Tape 3 Average 



1+ 


0 


0 


0 


0 


2 


0 




12 


4 


2+ 


0 


100 


12 


38 


3 


25 


88 


100 


71 


3+ 


62 


100 


100 


88 


i\ 


100 


88 


100 


96 


4+ 


IGG 


100 




100 



-262- 



FIGURE 1 



Acceptability Judgments for 
English as a Second Language 



100 



c 



0) 
O 
O 
< 

CO 
(U 
CP 



0) 
CO 

c 

0) 

o 



50 




1+2 2+3 
Interview Score Level 



-263- 



TABLE 2 



English Component of Bilingual Education 
(11 judges) 

Interview 

S core Level Percentage of Judges Accepting Interview Segment 





Tape 1 


Tape 2 


Tape 3 


Average 


1+ 


27 


27 


9 


21 


2 


0 


73 


64 


45 


2+ 


36 


100 


64 


67 


3 


55 


100 


91 


82 


3+ 


■100 


100 


91 


97 


4 


100 


100 


100 


100 


4+ 


100 


100 




100 



-264- 
FIGURE 2 



Acceptability Judgments for 
English Component of Bilingual Education 



100 



c 



0) 

a 
a 

CO 
(U 

□n 
XJ 



a 
a. 



T — r 



50 



1+ 2 2+3 

Interview Score Level 



3+ 



4+ 



-265- 



S pan ish-Bilinqual . Table 3 and Figure 3 show the results of the 
jud'^ing of the opanish-language tapes by the Spanish-bilingual judges. 
These judges appear to have been slightly less consistent in their 
evaluations than the Cnglish-bil ingual judges. However, at least ten of 
the eleven judges agreed on eleven of the twenty interview segments, and a 
clear majority of the judges were in agreement on all but one of the 
interview segments. 

The results of the judging of the Spanish-language interview segments 
differ In one obvious way from the results of the judging of the English- 
language segments: the average percentage of acceptance does not rise 
steadily from one score level to the next, but shows a somewhat inconsis- 
tent pattern between levels 2+ and 4. These inconsistencies are probably 
the result of sampling variability in the small number of interview 
segments presented for judging. A desirable approach in such a situation 
would be to get judgments of several additional interview segments at 
these levels. This approach, however, would require reconvening the 
Spanish-bilingual judges for a further judging session. 

One way to deal with these fluctuations in the observed data is by 
means of a statistical technique known as "smoothing." The rationale for 
the use of smoothing with these data is the assumption chat if we could 
somehow get judgments of al T possible interviews at each score level, the 
average percentage of acceptance would increase steadily across the scure 
levels. Thus, if a graph similar to Figure 3 were drawn on the basis of 
judgments of all possible interviews, the points representing the average 
percentage of acceptance would follow a smooth rising curve, as they do 
in Figures 1 and 2. The purpose of ismoothing is to provide a statistical 
estimate of that curve on the basis of the available data. This estimated 
curve is shown in Figure 3. Smoothing improves the estimation at each 
scores level by making use of information contained in the data from 
the adjacent score levels. The smoothing formula we used can be stated 
in words as follows: for each score level, the estimated (smoothed) 
percentage of acceptance is given by: 

one-half of the pe'centage of acceptance at that score level, plus 

one-fourth of the percentage of acceptance at the next lower score 
level, plus 

one-fourth of , the percentage of acceptance at the next higher scorf: 
level. 

The smoothed averages are an improvement over the actual observed aver- 
ages, in the sense that they can be expected to provide a better estimate 
of what the averages would have been had the judging session included a 
very large number of interview segments at each score level. 



-266- 



Interview 



TABLE 3 



Spanish Component of Bilingual Education 
(11 judges) 



Score Level 


Percentaqe 


of Judqes 


Accepting Interv 


lew Segment 




Tape 1 


Tape 2 


Tape 3 


Average 
Actual Smoot 


1 + 


0 


0 


D 


0 * 


2 


9 


27 


d 


12 26 


2+ 


73 


82 


82 


79 5A 


3 


iOO 


36 


0 


45 62 


3+ 


91 


91 


55 


79 68 


1 

A 


91 


27 


91 


69 77 


4+ 


100 


82 




91 * 



*The smoothing formula used does not provide for computation of smoothed values 
at the highest and lowest levels. 



-267- 



FIGURE 3 



Acceptability Judgments for 
Spanish Component of Bilingual Education 




-268- 



The smoothed average percentages of acceptance by the Spanish- 
bilingual judges are shown in the last column of Table 3. These 
percentages are lower than the corresponding percentages for the English- 
bilingual interviews at every score level. However, the curve in Figure 3 
crosses the line representing 50 percent acceptance at a point between 
interview score levels 2 and 2+, as is the casei for the English-bilingual 
judging. These results suggest that the teaching of Spanish-English 
bilingual classes requires a degree of Spanish-language proficiency 
that is at least as high as the degree of English-language proficiencv 
required, and possibly somewhat higher. 



Setting the Standard 

The research study provides an estimate of the relationship between 
a speaker's interview score and the probability that the speaker's 
proficiency will be judged adequate. It does not tell the decision maker 
how to use this information to set a standard. One way to proceed is to 
set the pass/fail cutoff for interview scores at the point where the 
probability of acceptance equals 50 percent. This choice has a simple 
rationale: speakers with interview scores below the cutoff are more 
likely to be judged unacceptable than they are to be judged acceptable, 
while the reverse is true for speakers with interview scores -Dbove the 
cutoff. 

Any decision based on less than perfect information involves the 
possibility of error. In the case of a pass/fail decision about a 
speaker whose interview score is known, there are two types of errors: 
passing a speaker who would have been judged inadequate, and failing a 
speaker who would have been judged adequate. The rationale for setting 
the pass/fail cutoff at the score that corresponds to 50 percent accep- 
tance is based on the implicit assumption that these two types of errors 
are equally serious. But what if they are not equally serious? For 
example, what if it is ♦-.wice as serious an- error to pass an inadequate 
speaker as to fail an adequate speaker? Obviously, in this case, the 
cutoff should be somewhat higher than the score that corresponds to a 50 
percent probability of acceptance, but how much higher?. 

Statistical decision theory (which, at its simplest levels, is 
really common sense expressed in mathematical language) provides the 
following answer: If it is twice as serious an error to pass an in- 
adequate speaker as to fail an adequate speaker, we can tolerate two 
errors of the second kind (failing a person who should pass) for every 
error of the first kind (passing a person who should fail). Therefore, we 
should raise, the cutoff to the interview score level at which there are 
twice as many adequate speakers as inadequate speakers. This is the score 
level that corresponds to a two-thirds (or 67 percent) probability of 
acceptance. At any interview score above this cutoff, the adequate 
speakers will outnumber the inadequate speakers by more than two to one, 
so we will do more harm by failing the adequate speakers than by passing 
the inadequate speakers at that score level. At any interview score below 



-269- 



the cutoff, the number of adequate speakers is less than twice the number 
of inadequate speakers, so we will do more harm by passing the inadequate 
speakers than by failing the adequate speakers at this level. 

The standard-setting ^jrocess, therefore, involves two kinds of 
judgment. The first is the judgment of speakers' proficiency as adequate 
or inadequate. The second iii the judgment of the relative seriousness of 
the two types of possible errors. These two kinds of judgment do not 
have to be made by the same persons, and often they will not be, since 
different kinds of competence are involved. The first kind of judgment 
requires the aoility to recognize adequate and inadequate performance; the 
second requires the ability to evaluate the consequences of adequate and 
inadequate performance. 

Summary 

The New Jersey LPI study is an example of a more general procedure 
for conducting an empirical standard-setting study. This general pro- 
cedure can be described as follows: 

1. Determine the measure of performance for which the standard is 
to be set. In general terms we can call this measure the test score . In 
the New Jersey study it was the Language Proficiency Interview score. 

2. Determine the type of performance that will serve as the basis 
for judging a person's proficiency as adequate or inadequate. In general 
terms we would call this performance the criterion performance . The 
criterion performance in the New Jersey LPI study was a portion of the 
interview itself. 

3. Identify a population of persons qualified to judge examples 
of the criterion performance as adequate or inadequate- Select a sample 
of these persons to serve as Judges . 

^. Identify the population of persons taking the test for which a 
standard is to be set and obtain their test scores. Select a sample of 
these examinees , making sure the range of their test scores is broad 
enough to include both the lowest and the highest scores that might 
conceivably bo selected as the standard. 

5- Obtain Judgments of the examinees' criterion performances by 
the judges. 

6. Analyze the data provided by these judgments to estimate 
the probab i 1 i ty that an examinee's criterion performance will be judged 
adequate, as a function of the examinee's test score. 

These six steps make up the empirical study. Two remaining steps complete 
the standard-setting procedure. 



-270- 



7. Determine the relative seriousness of the two types'of possible 
errors: passing an examinee whose criterion performance is inadequate and 
failing an examinee whose criterion.- performance is adequate. ^ 

8- Set the standard at the test score level that results in an 
equal risk of the two L/pes of possible errors, weighted by heir serious- 
ness in the particular decision-making situation for which a standard is 
to be set. 



